SYSTEMS AND METHODS FOR GENETIC ANALYSIS

The invention relates to using a graph database in genetic analyses to link mutation data to extrinsic data. Entities such as mutations, patients, samples, alleles, and clinical information are individually represented and stored as nodes and relationships between entities are also individually represented and stored. Each node and relationship can be stored using a fixed-size record and nodes can be flexibly invoked to represent any entity without disrupting the existing data. Systems and methods of the invention may be used for obtaining data representing a mutation in an individual and using a node in a graph database to store a description of the mutation. The node has stored within it a pointer to an adjacent node that provides information about a clinical significance of the variant. The graph database can be queried to provide a report of the clinical significance of the mutation.

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description
CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to, and the benefit of, U.S. Provisional Patent Application No. 62/037,861, filed Aug. 15, 2014, the contents of which are incorporated by reference.

TECHNICAL FIELD

The invention relates to medical genetics.

BACKGROUND

Before having children, a person may turn to genetic screening to find out if he or she is a carrier of a genetic condition. Genetic carrier screening can be done using next-generation sequencing (NGS), which produces millions of “base-calls” read from the person's genome. Typically, those base calls are then compared to a reference genome to determine their clinical significance. While all 3.2 billion base-pairs of the human genome are available for use as a reference (e.g., as hg18), knowing the clinical significance of features in the person's genome requires turning to medical literature or specialized databases of mutations. For example, the Online Mendelian Inheritance in Man (OMIM) database contains information on genetic disorders in over 12,000 human genes.

The volumes of data that must be stored, compared, and understood are a significant obstacle to realizing the full potential of NGS as a carrier screening tool. Generally, the time required for analysis and reporting is proportional to the amount of data in the databases. The structure of those databases requires exhaustive index table lookups for each comparison. Also, since databases designs must be locked in prior to use, a clinician's use of the data system is limited to what the database designer foresaw as the likely qualities of the data. A clinician who discovers a new phenomenon—such as and a novel combination of mutations associated with an unexpected disease—may be faced with a data system that does not even provide a means for entering or describing this information.

SUMMARY

The invention provides systems and methods for genetic analysis in which entities such as mutations, patients, samples, alleles, and clinical information are individually represented and stored as nodes and in which relationships between entities are also individually represented and stored. Each node and relationship can be stored using a fixed-size record and nodes can be flexibly invoked to represent any novel entity without disrupting the information already represented in the system. By forsaking the traditional database schema of indexed tables, the run time for queries need not be proportional to the amount of data in the tables. Instead, queries that start with a certain node can find the relevant related nodes in time proportional only to the number of nodes in the results that match the query. Moreover, novel entities and relationships can be inserted into the data system upon discovery with no disruption to the data or operation of the system. Thus, novel mutations can be added or related to disease phenotypes or appropriate literature references as that new information is discovered and observed. The time required for a query of—for example—relationships between a patient and disease-associated alleles in that patient's genome will be proportional to the number of results that are found for inclusion in a report for that patient. Where sequencing uncovers novel mutations or genotype/phenotype associations, those entities and relationships can be brought into the system and included in the reporting without requiring any changes or re-design to the underlying system architecture. In methods and systems of the invention, NGS results, patient information, and medical information can be stored in a graph database and analyzed using graph processing approaches and languages. This provides for very rapid querying and report generation, independent of the size of the underlying data store.

Since report generation is rapid and not linked to the underlying volume of data, and since systems of the invention may easily accommodate the volumes of data associated with NGS sequencing and human genome based analyses, systems and methods of the invention may be employed for NGS-based carrier screening and provide meaningful results to patients.

Additionally, the invention includes the insight that the clinical significance of mutations—or “variants”, e.g., as documented in NGS results such as Variant Call Format (VCF) files—can be shown by relating the mutation to a particular allele of a gene and showing where in the literature the variant is reported as pathogenic or benign while connecting this information back to a patient and lab sample for reporting purposes. Sequencing by existing NGS technologies may provide abundant high-quality raw data in the form of sequence files such as FASTA, FASTQ, Sequence Alignment Map (SAM), Binary Alignment Map (BAM), or VCF files. Systems and methods of the invention can be used to extract relevant data from those files into the described nodes to support the rapid querying and report generation useful for NGS carrier screening. For example, systems of the invention may include an Application Programming Interface (API) that takes as input VCF files and creates a network of nodes representing patients, samples, VCF files, VCF records, variants, alleles, and literature reports with relationships connecting adjacent pairs of those nodes according to their natural relationships. The system supports a genomics analysis clinical pipeline even as it changes and can accommodate the loading in of external data. The system can be implemented using a graph database and related software. Systems of the invention support a variety of analyses and use cases. For example, with NGS-based carrier screening implemented using the described graph database structure for analysis and reporting, it becomes easy to query and report such phenomenon as allele frequencies.

Importantly, systems and methods of the invention support the curation of variants. Curating variants includes identifying an individual variant in sequencing results, researching medical literature for information about the variant, classifying the variant (e.g., pathogenic, benign, somewhere in between), and accessioning that information into the database for use in subsequent reports on patient samples in which that variant is implicated. Using the nodes and relationships provided by the invention, variants can be connected to alleles, literature references, medical information, or combinations thereof. If changes are subsequently made (e.g., a missense mutation is re-classified as a nonsense mutation), other features of the system infrastructure are not disrupted. Thus the active curation of variants is accommodate and improves the system.

In certain aspects, the invention provides a method for analyzing mutations. The method includes obtaining data representing a mutation in a genome of an individual and using a node in a graph database to store a description of the mutation. The node has stored within it a pointer to an adjacent node that provides information about a clinical significance of the variant. The method includes querying the graph database to provide a report of the clinical significance of the mutation in the genome of the individual.

The data representing the mutation may be obtained by obtaining a sample that includes a nucleic acid from the individual and sequencing the nucleic acid to obtain a sequence read file that includes the data. The sample may be represented in the graph database using a sample node and the sample node may be connected via a pointer to a read file node representing the sequence read file. The graph database may include nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants as well as edges defining relationships between pairs of the nodes.

In some embodiments, the data representing a mutation is obtained as part of a file such as a variant call file (VCF), a sequence alignment map (SAM) file, a binary alignment map (BAM) file, a FASTA file, or a FASTQ file. The file may be represented in the graph database (e.g., using a file node) and a pointer to the file node may be stored in the mutation node.

In certain embodiments, the data representing a mutation comprises a description of the mutation as a variant of a reference human genome. The description of the mutation may be provided as a VCF record in a VCF file. The method may include obtaining sequencing data that represents a plurality of mutations in the genome of the individual—each of the plurality of mutations being represented as variant calls relative to a human genome reference. For each of the plurality of mutations, a corresponding variant node in the graph database is used to store a description of that mutation.

Aspects of the invention provide a system for describing genetic information. The system includes at least one computer comprising memory coupled to a processor. The system has at least a portion of a graph database stored therein. The system is operable to obtain data representing a mutation in a genome of an individual, use a variant node in the graph database to store a description of the mutation, and store—within the variant node—a pointer to an adjacent node that provides information about a clinical significance of the mutation. The system may be used to query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual. As discussed above, the data representing a mutation may be obtained as part of a file such as a VCF file. The system may represent the file as a file node in the graph database and store, in the variant node, a pointer to the file node.

The data representing the mutation may be provided as a sequence read file that includes that data. In certain embodiments, the system is operable use the graph database to represent a biological sample from the individual with a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

The system may be operated to obtain sequencing data representing a plurality of mutations in the genome of the individual (e.g., as variant calls relative to a human genome reference) and use, for each of the plurality of mutations, a corresponding variant node in the graph database to store a description of that mutation. The system links the individual to an allele node based on the plurality of mutations.

In a preferred aspect, the invention provides: a system for describing genetic information, the system comprising: at least one computer comprising memory coupled to a processor, the system having at least a portion of a graph database stored therein, wherein the system is operable to: obtain data representing a mutation in a genome of an individual; use a node in the graph database to store a description of the mutation; store, in the node, a pointer to an adjacent node that provides information about a clinical significance of the mutation; and query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual. Preferably a pointer identifies a physical location in the memory at which the adjacent node is stored. Thus each node may be stored at a specific physical location the memory. Each such specific physical location is referenced by a pointer (which itself optionally may be stored within a node at a physical location that is referenced, in-turn, by another pointer). Preferably, each pointer identifies a physical location in the memory subsystem at which the adjacent object is stored. In the preferred embodiments, the pointer or native pointer is manipulatable as a memory address in that it points to a physical location on the memory but also dereferencing the pointer accesses intended data. That is, a pointer is a reference to a datum stored somewhere in memory; to obtain that datum is to dereference the pointer. The feature that separates pointers from other kinds of reference is that a pointer's value is interpreted as a memory address, at a low-level or hardware level. The speed and efficiency of the described low-level, or hardware level, memory referencing allows for incredibly rapid graph traversals, which means that data content can scale up unbounded but reporting actionable medical genetic information will not require amounts of time that scale up with the data content. Use of hardware level references, or index-free adjacency, uncouples the time requirements for medical genetics reporting from data content volume.

In a first embodiment of the preferred aspect, the system is operable to obtain the data representing the mutation by receiving at least one sequence read file that includes the data. Preferably the system of the first embodiment is further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

In a second embodiment of the preferred aspect, the data representing the mutation is obtained as part of a file. In the second embodiment, the file may have a format selected from the group consisting of variant call format; sequence alignment map; binary alignment map; FASTA; and FASTQ. Preferably in the second embodiment the system is operable to represent the file as a file node in the graph database and store, in the variant node, a pointer to the file node. Optionally, the system is further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

In a third embodiment of the preferred aspect, the data representing the mutation comprises a description of the mutation as a variant of a reference human genome. In the third embodiment, the description of the mutation may optionally be obtained from a VCF record in a VCF file. Additionally, the system of the third embodiment may be further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

In a fourth embodiment of the preferred aspect, the system is further operable to: obtain sequencing data representing a plurality of mutations in the genome of the individual, the plurality of mutations being represented as variant calls relative to a human genome reference; use, for each of the plurality of mutations, a corresponding variant node in the graph database to store a description of that mutation; and link the individual to an allele node based on the plurality of mutations. In the fourth embodiment, the graph database may include: nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and edges defining relationships between pairs of the nodes. The system of the fourth embodiment may be further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

In a fifth embodiment of the preferred aspect, the graph database comprises: nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and edges defining relationships between pairs of the nodes. In the fifth embodiment, the system may be further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an exemplary NGS workflow for carrier screening.

FIG. 2 gives a sample of an exemplary VCF file.

FIG. 3 diagrams a method for analyzing mutations.

FIG. 4 gives a flow chart for a VCF file parser.

FIG. 5 presents a model of data received from parsing a VCF file.

FIG. 6 shows an entity relationship diagram (ERD) of the data modeled by FIG. 5.

FIG. 7 diagrams a high-level architecture of a system of the invention.

FIG. 8 illustrates a structure for nodes and relationships on disk.

FIG. 9 illustrates the use of a variant node to store a description of a mutation.

FIG. 10 shows an allele node showing that an allele includes a certain mutation.

FIG. 11 shows variant node connected to two different literature reference nodes.

FIG. 12 illustrates updating information about a mutation.

FIG. 13 presents an example database that may be queried for allele frequency.

FIG. 14 diagrams a system for performing methods of the invention.

DETAILED DESCRIPTION

The invention relates to using a graph database in genetic analyses to link mutation data to extrinsic data. Entities such as mutations, patients, samples, alleles, and clinical information are individually represented and stored as nodes and relationships between entities are also individually represented and stored. Each node and relationship can be stored using a fixed-size record and nodes can be flexibly invoked to represent any entity without disrupting the existing data. Systems and methods of the invention may be used for obtaining data representing a mutation in an individual and using a variant node in a graph database to store a description of the mutation. The variant node has stored within it a pointer to an adjacent node that provides information about a clinical significance of the variant. The graph database can be queried to provide a report of the clinical significance of the mutation. In certain embodiments, systems and methods of the invention operate within the context of a carrier screening workflow and provide a querying and reporting tool for carrier screening.

FIG. 1 illustrates an exemplary NGS workflow for carrier screening. The workflow combines automated, optimized molecular inversion probe target capture 109 with molecular barcoding to maximize the sample throughput of an NGS machine and employs assembly and alignment methods that allow accurate identification of both substitution and insertion/deletion lesions. The workflow is applicable to, for example, genes in which loss-of-function mutations cause recessive Mendelian disorders often included as part of routine carrier screening. A screening or analysis may begin with obtaining nucleic acid from a sample.

Nucleic acid in a sample can be any nucleic acid, including for example, genomic DNA in a tissue sample, cDNA amplified from a particular target in a laboratory sample, or mixed DNA from multiple organisms. In some embodiments, the sample includes homozygous DNA from a haploid or diploid organism. For example, a sample can include genomic DNA from a patient who is homozygous for a rare recessive allele. In other embodiments, the sample includes heterozygous genetic material from a diploid or polyploidy organism with a somatic mutation such that two related nucleic acids are present in allele frequencies other than 50 or 100%, i.e., 20%, 5%, 1%, 0.1%, or any other allele frequency.

In one embodiment, nucleic acid template molecules (e.g., DNA or RNA) are isolated from a biological sample containing a variety of other components, such as proteins, lipids, and non-template nucleic acids. Nucleic acid template molecules can be obtained from any cellular material, obtained from animal, plant, bacterium, fungus, or any other cellular organism. Biological samples for use in the present invention also include viral particles or preparations. Nucleic acid template molecules can be obtained directly from an organism or from a biological sample obtained from an organism, e.g., from blood, urine, cerebrospinal fluid, seminal fluid, saliva, sputum, stool, and tissue. Any tissue or body fluid specimen (e.g., a human tissue of bodily fluid specimen) may be used as a source for nucleic acid to use in the invention. Nucleic acid template molecules can also be isolated from cultured cells, such as a primary cell culture or cell line. The cells or tissues from which template nucleic acids are obtained can be infected with a virus or other intracellular pathogen. A sample can also be total RNA extracted from a biological specimen, a cDNA library, viral, or genomic DNA. A sample may also be isolated DNA from a non-cellular origin, e.g. amplified/isolated DNA from the freezer.

Generally, nucleic acid can be extracted, isolated, amplified, or analyzed by a variety of techniques such as those described by Green and Sambrook, Molecular Cloning: A Laboratory Manual (Fourth Edition), Cold Spring Harbor Laboratory Press, Woodbury, N.Y. 2,028 pages (2012); or as described in U.S. Pat. No. 7,957,913; U.S. Pat. No. 7,776,616; U.S. Pat. No. 5,234,809; U.S. Pub. 2010/0285578; and U.S. Pub. 2002/0190663.

Nucleic acid from a sample may optionally be fragmented or sheared to a desired length, using a variety of mechanical, chemical, and/or enzymatic methods. DNA may be randomly sheared via sonication using, for example, an ultrasonicator sold by Covaris (Woburn, Mass.), brief exposure to a DNase, or using a mixture of one or more restriction enzymes, or a transposase or nicking enzyme. RNA may be fragmented by brief exposure to an RNase, heat plus magnesium, or by shearing. The RNA may be converted to cDNA. If fragmentation is employed, the RNA may be converted to cDNA before or after fragmentation. In one embodiment, nucleic acid is fragmented by sonication. In another embodiment, nucleic acid is fragmented by a hydroshear instrument. Generally, individual nucleic acid template molecules can be from about 2 kb bases to about 40 kb. In a particular embodiment, nucleic acids are about 6 kb-10 kb fragments. Nucleic acid molecules may be single-stranded, double-stranded, or double stranded with single-stranded regions (for example, stem- and loop-structures).

A biological sample may be lysed, homogenized, or fractionated in the presence of a detergent or surfactant as needed. Suitable detergents may include an ionic detergent (e.g., sodium dodecyl sulfate or N-lauroylsarcosine) or a nonionic detergent (such as the polysorbate 80 sold under the trademark TWEEN by Uniqema Americas (Paterson, N.J.) or C14H22O(C2H4)n, known as TRITON X-100).

In certain embodiments, genomic DNA samples are input to a molecular inversion probe capture 109 reaction. Molecular inversion probes may be designed to capture the coding regions and as well as well-characterized noncoding regions of genes. Such probes may include 5′ and 3′ targeting arms (extension and ligation, respectively) of, for example, about a total of 40 nucleotides and being designed to flank 130-bp target regions. Each target is captured 109 by multiple probes that anneal to non-overlapping genomic intervals. PCR is performed 121 using primers containing patient-specific barcodes, yielding barcode libraries. Genomic DNA may be subjected to multiplex target capture using molecular inversion probes. Captured product may be subjected to PCR to attach molecular barcodes in a manner that allow sequencing from either end of the captured region.

PCR may be used as described or any other amplification reaction may be performed. Amplification refers to production of additional copies of a nucleic acid sequence and is generally carried out using polymerase chain reaction (PCR) or other technologies known in the art. The amplification reaction may be any amplification reaction known in the art that amplifies nucleic acid molecules such as PCR (e.g., nested PCR, PCR-single strand conformation polymorphism, ligase chain reaction, strand displacement amplification and restriction fragments length polymorphism, transcription based amplification system, rolling circle amplification, and hyper-branched rolling circle amplification, quantitative PCR, quantitative fluorescent PCR (QF-PCR), multiplex fluorescent PCR (MF-PCR), real time PCR (RTPCR), restriction fragment length polymorphism PCR). See U.S. Pat. No. 5,242,794; U.S. Pat. No. 5,494,810; U.S. Pat. No. 4,988,617; U.S. Pat. No. 6,582,938; U.S. Pat. No. 4,683,195; and U.S. Pat. No. 4,683,202, hereby incorporated by reference. Primers for PCR, sequencing, and other methods can be prepared by cloning, direct chemical synthesis, and other methods known in the art. Primers can also be obtained from commercial sources such as Eurofins MWG Operon (Huntsville, Ala.) or Life Technologies (Carlsbad, Calif.).

Amplification adapters may be attached to the fragmented nucleic acid. Adapters may be commercially obtained, such as from Integrated DNA Technologies (Coralville, Iowa). In certain embodiments, the adapter sequences are attached to the template nucleic acid molecule with an enzyme. The enzyme may be a ligase or a polymerase. The ligase may be any enzyme capable of ligating an oligonucleotide (RNA or DNA) to the template nucleic acid molecule. Suitable ligases include T4 DNA ligase and T4 RNA ligase, available commercially from New England Biolabs (Ipswich, Mass.). Methods for using ligases are well known in the art. The polymerase may be any enzyme capable of adding nucleotides to the 3′ and the 5′ terminus of template nucleic acid molecules.

Embodiments of the invention involve attaching the bar code sequences to the template nucleic acids e.g., for barcode PCR 121. In certain embodiments, a bar code is attached to each fragment. In other embodiments, a plurality of bar codes, e.g., two bar codes, are attached to each fragment. A bar code sequence generally includes certain features that make the sequence useful in sequencing reactions. For example the bar code sequences are designed to have minimal or no homo-polymer regions, i.e., 2 or more of the same base in a row such as AA or CCC, within the bar code sequence. The bar code sequences are also designed so that they are at least one edit distance away from the base addition order when performing base-by-base sequencing, ensuring that the first and last base do not match the expected bases of the sequence.

The bar code sequences are designed such that each sequence is correlated to a particular portion of nucleic acid, allowing sequence reads to be correlated back to the portion from which they came. Methods of designing sets of bar code sequences are shown for example in U.S. Pat. No. 6,235,475, the contents of which are incorporated by reference herein in their entirety. In certain embodiments, the bar code sequences range from about 5 nucleotides to about 15 nucleotides. In a particular embodiment, the bar code sequences range from about 4 nucleotides to about 7 nucleotides. Since the bar code sequence is sequenced along with the template nucleic acid, the oligonucleotide length should be of minimal length so as to permit the longest read from the template nucleic acid attached. Generally, the bar code sequences are spaced from the template nucleic acid molecule by at least one base (minimizes homo-polymeric combinations). In certain embodiments, the bar code sequences are attached to the template nucleic acid molecule, e.g., with an enzyme. The enzyme may be a ligase or a polymerase, as discussed below. Attaching bar code sequences to nucleic acid templates is shown in U.S. Pub. 2008/0081330 and U.S. Pub. 2011/0301042, the contents of which are incorporated by reference herein in its entirety. Methods for designing sets of bar code sequences and other methods for attaching bar code sequences are shown in U.S. Pat. Nos. 7,544,473; 7,537,897; 7,393,665; 6,352,828; 6,172,218; 6,172,214; 6,150,516; 6,138,077; 5,863,722; 5,846,719; 5,695,934; and 5,604,097, each incorporated by reference.

After any processing steps (e.g., obtaining, isolating, fragmenting, amplification, or barcoding), nucleic acid can be sequenced 129.

Sequencing 129 may be by any method known in the art. DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing by synthesis using reversibly terminated labeled nucleotides, pyrosequencing, 454 sequencing, Illumina/Solexa sequencing, allele specific hybridization to a library of labeled oligonucleotide probes, sequencing by synthesis using allele specific hybridization to a library of labeled clones that is followed by ligation, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, polony sequencing, and SOLiD sequencing. Sequencing of separated molecules has more recently been demonstrated by sequential or single extension reactions using polymerases or ligases as well as by single or sequential differential hybridizations with libraries of probes.

A sequencing technique that can be used includes, for example, use of sequencing-by-synthesis systems sold under the trademarks GS JUNIOR, GS FLX+ and 454 SEQUENCING by 454 Life Sciences, a Roche company (Branford, Conn.), and described by Margulies, M. et al., Genome sequencing in micro-fabricated high-density picotiter reactors, Nature, 437:376-380 (2005); U.S. Pat. No. 5,583,024; U.S. Pat. No. 5,674,713; and U.S. Pat. No. 5,700,673, the contents of which are incorporated by reference herein in their entirety. 454 sequencing involves two steps. In the first step of those systems, DNA is sheared into fragments of approximately 300-800 base pairs, and the fragments are blunt ended. Oligonucleotide adaptors are then ligated to the ends of the fragments. The adaptors serve as primers for amplification and sequencing of the fragments. The fragments can be attached to DNA capture beads, e.g., streptavidin-coated beads using, e.g., Adaptor B, which contains 5′-biotin tag. The fragments attached to the beads are PCR amplified within droplets of an oil-water emulsion. The result is multiple copies of clonally amplified DNA fragments on each bead. In the second step, the beads are captured in wells (pico-liter sized). Pyrosequencing is performed on each DNA fragment in parallel. Addition of one or more nucleotides generates a light signal that is recorded by a CCD camera in a sequencing instrument. The signal strength is proportional to the number of nucleotides incorporated. Pyrosequencing makes use of pyrophosphate (PPi) which is released upon nucleotide addition. PPi is converted to ATP by ATP sulfurylase in the presence of adenosine 5′ phosphosulfate. Luciferase uses ATP to convert luciferin to oxyluciferin, and this reaction generates light that is detected and analyzed.

Another example of a DNA sequencing technique that can be used is SOLiD technology by Applied Biosystems from Life Technologies Corporation (Carlsbad, Calif.). In SOLiD sequencing, genomic DNA is sheared into fragments, and adaptors are attached to the 5′ and 3′ ends of the fragments to generate a fragment library. Alternatively, internal adaptors can be introduced by ligating adaptors to the 5′ and 3′ ends of the fragments, circularizing the fragments, digesting the circularized fragment to generate an internal adaptor, and attaching adaptors to the 5′ and 3′ ends of the resulting fragments to generate a mate-paired library. Next, clonal bead populations are prepared in microreactors containing beads, primers, template, and PCR components. Following PCR, the templates are denatured and beads are enriched to separate the beads with extended templates. Templates on the selected beads are subjected to a 3′ modification that permits bonding to a glass slide. The sequence can be determined by sequential hybridization and ligation of partially random oligonucleotides with a central determined base (or pair of bases) that is identified by a specific fluorophore. After a color is recorded, the ligated oligonucleotide is removed and the process is then repeated.

Another example of a DNA sequencing technique that can be used is ion semiconductor sequencing using, for example, a system sold under the trademark ION TORRENT by Ion Torrent by Life Technologies (South San Francisco, Calif.). Ion semiconductor sequencing is described, for example, in Rothberg, et al., An integrated semiconductor device enabling non-optical genome sequencing, Nature 475:348-352 (2011); U.S. Pub. 2010/0304982; U.S. Pub. 2010/0301398; U.S. Pub. 2010/0300895; U.S. Pub. 2010/0300559; and U.S. Pub. 2009/0026082, the contents of each of which are incorporated by reference in their entirety.

Another example of a sequencing 129 technology that can be used is Illumina sequencing. Illumina sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Genomic DNA is fragmented, and adapters are added to the 5′ and 3′ ends of the fragments. DNA fragments that are attached to the surface of flow cell channels are extended and bridge amplified. The fragments become double stranded, and the double stranded molecules are denatured. Multiple cycles of the solid-phase amplification followed by denaturation can create several million clusters of approximately 1,000 copies of single-stranded DNA molecules of the same template in each channel of the flow cell. Primers, DNA polymerase and four fluorophore-labeled, reversibly terminating nucleotides are used to perform sequential sequencing. After nucleotide incorporation, a laser is used to excite the fluorophores, and an image is captured and the identity of the first base is recorded. The 3′ terminators and fluorophores from each incorporated base are removed and the incorporation, detection and identification steps are repeated. Sequencing according to this technology is described in U.S. Pat. No. 7,960,120; U.S. Pat. No. 7,835,871; U.S. Pat. No. 7,232,656; U.S. Pat. No. 7,598,035; U.S. Pat. No. 6,911,345; U.S. Pat. No. 6,833,246; U.S. Pat. No. 6,828,100; U.S. Pat. No. 6,306,597; U.S. Pat. No. 6,210,891; U.S. Pub. 2011/0009278; U.S. Pub. 2007/0114362; U.S. Pub. 2006/0292611; and U.S. Pub. 2006/0024681, each of which are incorporated by reference in their entirety.

Another example of a sequencing technology that can be used includes the single molecule, real-time (SMRT) technology of Pacific Biosciences (Menlo Park, Calif.). In SMRT, each of the four DNA bases is attached to one of four different fluorescent dyes. These dyes are phospholinked. A single DNA polymerase is immobilized with a single molecule of template single stranded DNA at the bottom of a zero-mode waveguide (ZMW). It takes several milliseconds to incorporate a nucleotide into a growing strand. During this time, the fluorescent label is excited and produces a fluorescent signal, and the fluorescent tag is cleaved off. Detection of the corresponding fluorescence of the dye indicates which base was incorporated. The process is repeated.

Another example of a sequencing technique that can be used is nanopore sequencing (Soni & Meller, 2007, Progress toward ultrafast DNA sequence using solid-state nanopores, Clin Chem 53(11):1996-2001). A nanopore is a small hole, of the order of 1 nanometer in diameter. Immersion of a nanopore in a conducting fluid and application of a potential across it results in a slight electrical current due to conduction of ions through the nanopore. The amount of current which flows is sensitive to the size of the nanopore. As a DNA molecule passes through a nanopore, each nucleotide on the DNA molecule obstructs the nanopore to a different degree. Thus, the change in the current passing through the nanopore as the DNA molecule passes through the nanopore represents a reading of the DNA sequence.

Another example of a sequencing technique that can be used involves using a chemical-sensitive field effect transistor (chemFET) array to sequence DNA (for example, as described in U.S. Pub. 2009/0026082). In one example of the technique, DNA molecules can be placed into reaction chambers, and the template molecules can be hybridized to a sequencing primer bound to a polymerase. Incorporation of one or more triphosphates into a new nucleic acid strand at the 3′ end of the sequencing primer can be detected by a change in current by a chemFET. An array can have multiple chemFET sensors. In another example, single nucleic acids can be attached to beads, and the nucleic acids can be amplified on the bead, and the individual beads can be transferred to individual reaction chambers on a chemFET array, with each chamber having a chemFET sensor, and the nucleic acids can be sequenced.

Another example of a sequencing technique that can be used involves using an electron microscope as described, for example, by Moudrianakis, E. N. and Beer M., in Base sequence determination in nucleic acids with the electron microscope, III. Chemistry and microscopy of guanine-labeled DNA, PNAS 53:564-71 (1965). In one example of the technique, individual DNA molecules are labeled using metallic labels that are distinguishable using an electron microscope. These molecules are then stretched on a flat surface and imaged using an electron microscope to measure sequences.

Sequencing according to embodiments of the invention generates a plurality of reads. Reads according to the invention generally include sequences of nucleotide data less than about 5000 bases in length, or less than about 150 bases in length. In certain embodiments, reads are between about 80 and about 90 bases, e.g., about 85 bases in length. In some embodiments, methods of the invention are applied to very short reads, i.e., less than about 50 or about 30 bases in length. Sequence read data can include the sequence data as well as meta information. Sequence read data can be stored in any suitable file format including, for example, VCF files, FASTA files or FASTQ files, as are known to those of skill in the art. In some embodiments, PCR product is pooled and sequenced (e.g., on an Illumina HiSeq 2000). Raw .bcl files are converted to qseq files using bclConverter (Illumina). FASTQ files are generated by “de-barcoding” genomic reads using the associated barcode reads; reads for which barcodes yield no exact match to an expected barcode, or contain one or more low-quality base calls, may be discarded. Reads may be stored in any suitable format such as, for example, FASTA or FASTQ format.

FASTA is originally a computer program for searching sequence databases and the name FASTA has come to also refer to a standard file format. See Pearson & Lipman, 1988, Improved tools for biological sequence comparison, PNAS 85:2444-2448. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column. The word following the “>” symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the “>” and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence.

The FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. It is similar to the FASTA format but with quality scores following the sequence data. Both the sequence letter and quality score are encoded with a single ASCII character for brevity. The FASTQ format is a de facto standard for storing the output of high throughput sequencing instruments such as the Illumina Genome Analyzer. Cock et al., 2009, The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucleic Acids Res 38(6):1767-1771.

For FASTA and FASTQ files, meta information includes the description line and not the lines of sequence data. In some embodiments, for FASTQ files, the meta information includes the quality scores. For FASTA and FASTQ files, the sequence data begins after the description line and is present typically using some subset of IUPAC ambiguity codes optionally with “-”. In a preferred embodiment, the sequence data will use the A, T, C, G, and N characters, optionally including “-” or U as-needed (e.g., to represent gaps or uracil).

Following sequencing, reads are preferably mapped 135 to a reference using assembly and alignment techniques known in the art or developed for use in the workflow. Various strategies for the alignment and assembly of sequence reads, including the assembly of sequence reads into contigs, are described in detail in U.S. Pat. No. 8,209,130, incorporated herein by reference. Strategies may include (i) assembling reads into contigs and aligning the contigs to a reference; (ii) aligning individual reads to the reference; (iii) assembling reads into contigs, aligning the contigs to a reference, and aligning the individual reads to the contigs; or (iv) other strategies known to be developed or known in the art. Mapping 135, it can be seen, may employ assembly steps, alignment steps, or both. Assembly can be implemented, for example, by the program ‘The Short Sequence Assembly by k-mer search and 3′ read Extension’ (SSAKE), from Canada's Michael Smith Genome Sciences Centre (Vancouver, B.C., CA) (see, e.g., Warren et al., 2007, Assembling millions of short DNA sequences using SSAKE, Bioinformatics, 23:500-501). SSAKE cycles through a table of reads and searches a prefix tree for the longest possible overlap between any two sequences. SSAKE clusters reads into contigs.

Another read assembly program is Forge Genome Assembler, written by Darren Platt and Dirk Evers and available through the SourceForge web site maintained by Geeknet (Fairfax, Va.) (see, e.g., DiGuistini et al., 2009, De novo sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data, Genome Biology, 10:R94). Forge distributes its computational and memory consumption to multiple nodes, if available, and has therefore the potential to assemble large sets of reads. Forge was written in C++ using the parallel MPI library. Forge can handle mixtures of reads, e.g., Sanger, 454, and Illumina reads.

Assembly through multiple sequence alignment can be performed, for example, by the program Clustal Omega, (Sievers et al., 2011, Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega, Mol Syst Biol 7:539), ClustalW, or ClustalX (Larkin et al., 2007, Clustal W and Clustal X version 2.0, Bioinformatics, 23(21):2947-2948) available from University College Dublin (Dublin, Ireland).

Another exemplary read assembly program known in the art is Velvet, available through the web site of the European Bioinformatics Institute (Hinxton, UK) (Zerbino & Birney, Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Genome Research 18(5):821-829). Velvet implements an approach based on de Bruijn graphs, uses information from read pairs, and implements various error correction steps.

Read assembly can be performed with the programs from the package SOAP, available through the website of Beijing Genomics Institute (Beijing, Conn.) or BGI Americas Corporation (Cambridge, Mass.). For example, the SOAPdenovo program implements a de Bruijn graph approach. SOAP3/GPU aligns short reads to a reference sequence.

Another read assembly program is ABySS, from Canada's Michael Smith Genome Sciences Centre (Vancouver, B.C., CA) (Simpson et al., 2009, ABySS: A parallel assembler for short read sequence data, Genome Res., 19(6):1117-23). ABySS uses the de Bruijn graph approach and runs in a parallel environment.

Read assembly can also be done by Roche's GS De Novo Assembler, known as gsAssembler or Newbler (NEW assemBLER), which is designed to assemble reads from the Roche 454 sequencer (described, e.g., in Kumar & Blaxter, 2010, Comparing de novo assemblers for 454 transcriptome data, Genomics 11:571 and Margulies 2005). Newbler accepts 454 Flx Standard reads and 454 Titanium reads as well as single and paired-end reads and optionally Sanger reads. Newbler is run on Linux, in either 32 bit or 64 bit versions. Newbler can be accessed via a command-line or a Java-based GUI interface. Additional discussion of read assembly may be found in Li et al., 2009, The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics 25:2078; Lin et al., 2008, ZOOM! Zillions Of Oligos Mapped, Bioinformatics 24:2431; Li & Durbin, 2009, Fast and accurate short read alignment with Burrows-Wheeler Transform, Bioinformatics 25:1754; and Li, 2011, Improving SNP discovery by base alignment quality, Bioinformatics 27:1157. Assembled sequence reads may preferably be aligned to a reference.

Methods for alignment and known in the art and may make use of a computer program that performs alignment, such as Burrows-Wheeler Aligner.

In certain embodiments, reads are aligned to hg18 on a per-sample basis using Burrows-Wheeler Aligner version 0.5.7 for short alignments, and genotype calls are made using Genome Analysis Toolkit. See McKenna et al., 2010, The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data, Genome Res 20(9):1297-1303. High-confidence genotype calls may be defined as having depth ≧50 and strand bias score ≦0. Clinical significance of variant calls is an important question in carrier screening and will be addressed below. Other computer programs for assembling reads are known in the art. Such assembly programs can run on a single general-purpose computer, on a cluster or network of computers, or on specialized computing devices dedicated to sequence analysis.

In some embodiments, de-barcoded fastq files are obtained as described above and partitioned by capture region (exon) using the target arm sequence as a unique key. Reads are assembled in parallel by exon using SSAKE version 3.7 with parameters “-m 30 -o 15”. The resulting contiguous sequences (contigs) can be aligned to hg18 (e.g., using BWA version 0.5.7 for long alignments with parameter “-r 1”). In some embodiments, short-read alignment is performed as described above, except that sample contigs (rather than hg18) are used as the input reference sequence. Software may be developed in Java to accurately transfer coordinate and variant data (gaps) from local sample space to global reference space for every BAM-formatted alignment. Genotyping and base-quality recalibration may be performed on the coordinate-translated BAM files using the GATK program.

In some embodiments, any or all of the steps of the invention are automated. For example, a Perl script or shell script can be written to invoke any of the various programs discussed above (see, e.g., Tisdall, Mastering Perl for Bioinformatics, O'Reilly & Associates, Inc., Sebastopol, C A 2003; Michael, R., Mastering Unix Shell Scripting, Wiley Publishing, Inc., Indianapolis, Ind. 2003). Alternatively, methods of the invention may be embodied wholly or partially in one or more dedicated programs, for example, each optionally written in a compiled language such as C++ then compiled and distributed as a binary. Methods of the invention may be implemented wholly or in part as modules within, or by invoking functionality within, existing sequence analysis platforms. In certain embodiments, methods of the invention include a number of steps that are all invoked automatically responsive to a single starting queue (e.g., one or a combination of triggering events sourced from human activity, another computer program, or a machine). Thus, the invention provides methods in which any or the steps or any combination of the steps can occur automatically responsive to a queue. Automatically generally means without intervening human input, influence, or interaction (i.e., responsive only to original or pre-queue human activity).

Mapping 135 sequence reads to a reference, by whatever strategy, may produce output such as a text file or an XML file containing sequence data such as a sequence of the nucleic acid aligned to a sequence of the reference genome. In certain embodiments (e.g., see FIG. 1) mapping 135 reads to a reference produces results stored in SAM or BAM file 179 and such results may contain coordinates or a string describing one or more mutations in the subject nucleic acid relative to the reference genome. Alignment strings known in the art include Simple UnGapped Alignment Report (SUGAR), Verbose Useful Labeled Gapped Alignment Report (VULGAR), and Compact Idiosyncratic Gapped Alignment Report (CIGAR) (Ning, Z., et al., Genome Research 11(10):1725-9 (2001)). These strings are implemented, for example, in the Exonerate sequence alignment software from the European Bioinformatics Institute (Hinxton, UK).

In some embodiments, a sequence alignment is produced—such as, for example, a sequence alignment map (SAM) or binary alignment map (BAM) file—comprising a CIGAR string (the SAM format is described, e.g., in Li, et al., The Sequence Alignment/Map format and SAMtools, Bioinformatics, 2009, 25(16):2078-9). In some embodiments, CIGAR displays or includes gapped alignments one-per-line. CIGAR is a compressed pairwise alignment format reported as a CIGAR string. A CIGAR string is useful for representing long (e.g. genomic) pairwise alignments. A CIGAR string is used in SAM format to represent alignments of reads to a reference genome sequence.

A CIGAR string follows an established motif. Each character is preceded by a number, giving the base counts of the event. Characters used can include M, I, D, N, and S (M=match; I=insertion; D=deletion; N=gap; S=substitution). The CIGAR string defines the sequence of matches/mismatches and deletions (or gaps). For example, the CIGAR string 2MD3M2D2M will mean that the alignment contains 2 matches, 1 deletion (number 1 is omitted in order to save some space), 3 matches, 2 deletions and 2 matches. In general, for carrier screening or other assays such as the NGS workflow depicted in FIG. 1, sequencing results will be used in genotyping 141.

Output from mapping 135 may be stored in a SAM or BAM file 179, in a variant call format (VCF) file 183, or other format. In an illustrative embodiment, output is stored in a VCF file, although methods described herein are applicable to other file formats such as SAM or BAM files, as will be readily apparent to one of skill in the art.

FIG. 2 gives a sample of an exemplary VCF file 183. A typical VCF file 183 will include a header section and a data section. The header contains an arbitrary number of meta-information lines, each starting with characters ‘##’, and a TAB delimited field definition line starting with a single ‘#’ character. The field definition line names eight mandatory columns and the body section contains lines of data populating the columns defined by the field definition line. The VCF format is described in Danecek et al., 2011, The variant call format and VCFtools, Bioinformatics 27(15):2156-2158.

The data contained in a VCF file 183 as shown for example in FIG. 2 represents the variants, or mutations, that are found in the nucleic acid that was obtained from the sample from the patient and sequenced. In its original sense, mutation refers to a change in genetic information and has come to refer to the present genotype that results from a mutation. As is known in the art, mutations include different types of mutations such as substitutions, insertions or deletions (INDELs), translocations, inversions, chromosomal abnormalities, and others. By convention in some contexts where two or more versions of genetic information or alleles are known, the one thought to have the predominant frequency in the population is denoted the wild type and the other(s) are referred to as mutation(s). In general in some contexts an absolute allele frequency is not determined (i.e., not every human on the planet is genotyped) but allele frequency refers to a calculated probable allele frequency based on sampling and known statistical methods and often an allele frequency is reported in terms of a certain population such as humans of a certain ethnicity. Variant can be taken to be roughly synonymous to mutation but referring to a genotype being described in comparison or with reference to a reference genotype or genome. For example as used in bioinformatics variant describes a genotype feature in comparison to a reference such as the human genome (e.g., hg18 or hg19 which may be taken as a wild type). An NGS workflow and genotype 141 generates data representing one or more mutations in a genome of an individual that are generally reported as variants, or “variant calls”, in, for example, a VCF file 183.

With continuing reference to FIG. 2, a VCF file 183 includes data representing one or more mutations. Those data may be analyzed by methods of the invention to provide a report of the clinical significance of the mutations in the genome of the individual.

FIG. 3 diagrams a method 301 for analyzing mutations according to the invention. One benefit of a method 301 is an ability to provide information about the clinical significance of mutations in a patient's genome from data such as that provided by sequencing, e.g., in FASTA/FASTQ files, SAM/BAM files, or VCF files. Methods include obtaining 305 data representing a mutation in a genome of an individual by, for example, the sampling, sequencing, and mapping methods described above. A variant node in a graph database is used 311 to store a description of the mutation. A pointer is stored 317 in the variant node and the pointer points to an adjacent node that provides information about a clinical significance of the variant. Method 301 includes querying 323 the graph database to obtain information reporting the clinical significance of the mutation in the genome of the individual.

To illustrate operation of the invention, the following discusses obtaining mutation data in a VCF file, although one of skill in the art will readily see that the discussion is extensible to other formats. Using a workflow such as the NGS workflow illustrated in FIG. 1, a VCF file containing mutation data is obtained 305. The VCF file may be parsed to isolate its component pieces of information and to consider each piece of information for its own significance. There exist programs or application programming interfaces (APIs) for parsing VCF files 183 or a program may be written that parses data from the VCF file.

FIG. 4 gives a flow chart for a VCF parser. The flow chart shown in FIG. 4 represents the conceptual steps that may go into parsing a VCF file and extracting component information. Since the various action blocks and loops are defined according to the format of the VCF file as standardized (e.g., in Danecek, 2011, Bioinformatics 27:2156), each character of information that is extracted is treated for what it is. Thus, using VCF file 183 from FIG. 2 for reference, the “A” that appears on line 16, character 7 (counting 1 tab as 1 character) is treated as a nucleotide in the reference and the “A” that appears in line 17, character 17 is simply part of the word “PASS” in the FILTER column. It is further recognized that line 16 (and any subsequent line) is a single VCF record within a VCF file. Each record from the VCF file represents something found by sequencing the nucleic acid from the sample from the patient. Each patient, having numerous genes in their genome, has numerous alleles. Thus where carrier screening is performed for a patient, the VCF run (e.g., all the VCF files produced by the NGS sequencing) ultimately documents and shows the various alleles in the patient's genome that were probed for by the probes used.

FIG. 5 presents a model of data received from parsing a VCF. As just discussed, one run from the sequencing instruments can produce a plurality of VCF files. Each VCF file typically contains a plurality of VCF records. Those records ultimately relate back to the samples from which they were derived, and the samples can each contain a plurality of alleles. However, this relationship just described can also be described using an entity relationship diagram, or ERD.

FIG. 6 shows an entity relationship diagram (ERD) 601 of the data modeled by FIG. 5. An insight of the invention is that the ERD 601 satisfies the definition of a graph as used in graph theory within mathematics and computer science. Graph theory provides a well-known mathematical tool for representing systems. Graph theory is the mathematical study of properties of formal mathematical structures called graphs. In that context, a graph is a finite set of points, termed vertices or nodes, connected by links termed edges or arcs. A graph thus generally defines a set of vertices and a set of pairs of vertices, which are the edges of the graph. There are several types of graphs in graph theory. The type of a particular graph largely depends upon the features of its components, namely the attributes of its vertices and edges. For example, when the set of pairs includes only distinct elements, the graph is called a simple graph; when one or more pairs are connected by multiple edges the graph is called a multi-graph; when one or more vertices are connected to themselves the graph is called a pseudo-graph; when the edges are assigned with directions the graph is called a directed graph or a digraph; and when the pairs of vertices are unordered the graph is called undirected. Additional illustrative background on graph theory may be found in U.S. Pat. No. 8,463,895 to Arora; U.S. Pat. No. 8,462,161 to Barber; U.S. Pat. No. 7,523,117 to Zhang; U.S. Pat. No. 6,360,235 to Tilt; U.S. Pub. 2013/0222388 to McDonald; and U.S. Pub. 2007/0244675 to Shai, the contents of each of which are incorporated by reference.

It can be observed that ERD 601 presents a graph—a collection of vertices and edges—or another description would be a set of nodes and the relationships that connect them. Graphs represent entities as nodes and the ways in which those entities relate to the world as relationships. This general-purpose, expressive structure allows graphs to model all kinds of phenomena such as NGS sequence files and their relationships to the source biological samples and genetic concepts like certain alleles. There are various dominant graph data models such as the property graph, Resource Description Framework (RDF) triples, and hypergraphs. In certain embodiments, a graph database used in the invention uses the property graph model.

A property graph has characteristics such as containing nodes and relationships (which are illustrated by ERD 601 in FIG. 6). The nodes contain properties (key-value pairs). Relationships are named and directed, and have a start and end node; and relationships can also contain properties. A graph database management system (henceforth, a graph database) is an online database management system with Create, Read, Update, and Delete (CRUD) methods that expose a graph data model. Graph databases according to the invention may be described or characterized according to the underlying storage, the processing engine, or both.

Regarding the underlying storage, some graph databases use native graph storage that is optimized and designed for storing and managing graphs. Some databases serialize the graph data into a relational database, an object-oriented database, or some other general-purpose data store and present graph database functionality on top of that.

Regarding the processing engine, some graph databases use index-free adjacency, meaning that connected nodes physically “point” to each other in the database. More broadly, graph databases can include any database that from the user's perspective behaves like a graph database (i.e., exposes a graph data model through CRUD operations) qualifies as a graph database. In certain embodiments, however, the invention provides the significant performance advantages of index-free adjacency. Native graph processing may describe graph databases that use index-free adjacency.

A benefit of native graph storage is that it is engineered for performance and scalability. A benefit of non-native graph storage is that it typically depends on a mature non-graph backend (such as MySQL) whose production characteristics are well understood by operations teams. Native graph processing (index-free adjacency) benefits traversal performance.

In the graph data model, relationships are included as entities that themselves are stored as objects. (Whereas other database management systems require connections between entities to be inferred using contrived properties such as foreign keys, or out-of-band processing like map-reduce.) By assembling the simple abstractions of nodes and relationships into connected structures, graph databases provide arbitrarily sophisticated models that map closely to the problem domain (e.g., FIG. 5). The resulting models are simpler and at the same time more expressive than those produced using traditional relational databases and the other NOSQL stores.

Any suitable graph database can be used to implement the systems and methods described herein. Exemplary graph databases may include Microsoft Infinite Graph, Titan, OrientDB, Neo4j, *dex, Franz Inc., AllegroGraph, and Hypergraphdb. Preferably, systems and methods of the invention employ a graph compute engine.

A graph compute engine is a technology that enables global graph computational algorithms to be run against large datasets. Graph compute engines are designed to do things like identify clusters in the data, or answer questions about how entities are connected, and particularly to trace across a series of linked ideas (e.g., SNP to allele to genetic condition to a literature reference providing a clinical significance of the allele containing the SNP).

A variety of different types of graph compute engines exist. Most notably there are in-memory/single machine graph compute engines like Cassovary, and distributed graph compute engines like Pegasus or Giraph. A distributed graph compute engine may be structured as described in Malewicz, et al., 2010, Pregel: a system for large-scale graph processing, Proceedings ACM SIGMOD Int Conf Management Data 135-146. Also see Rodriguez and Neubauer, 2010, Constructions from Dots and Lines, Bulletin Am Soc Inf Sci Tech 36(6):35-41.

In preferred embodiments, systems and methods of the invention store mutation descriptions using a graph database and analyze mutations in graph space.

To achieve the benefits potentially offered by using a graph database, a genetic analysis pipeline and methodology according to the invention uses nodes as well as named and directed relationships, with both the nodes and relationships serving as containers for properties. With continuing reference to FIG. 6, nodes and relationships are illustrated and index-free adjacency is discussed.

A database engine that utilizes index-free adjacency is one in which each node maintains direct references to its adjacent nodes. Each node thus acts as a micro-index of other nearby nodes, which is much cheaper than using global indexes. It means that query times are independent of the total size of the graph, and are instead simply proportional to the amount of the graph searched.

A non-native graph database engine, in contrast, uses (global) indexes to link nodes together. These indexes add a layer of indirection to each traversal, thereby incurring greater computational cost. Proponents for native graph processing argue that index-free adjacency is crucial for fast, efficient graph traversals. To understand why native graph processing is so much more efficient than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(l) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.

Index-free adjacency provides lower-cost “joins.” With index-free adjacency, bidirectional joins are effectively pre-computed and stored in the database as relationships. In contrast, when using indexes to fake connections between records, there is no actual relationship stored in the database. This becomes problematic for traversals in the “opposite” direction from the one for which the index was constructed. Because such traversals require a brute-force search through the index—which is an O(n)operation—and joins like this are simply too costly to be of any practical use. Index free adjacency provides surprising benefits in the context of reporting clinical significance of the results of NGS-based carrier screening in that the concepts involved are of just such a nature as to naturally lend themselves to representation using the pre-computed bidirectional joins offered by index free adjacency.

For at least these reasons, systems and methods of certain embodiments of the invention use index-free adjacency to ensure high-performance traversals. FIG. 6 shows how relationships eliminate the need for index lookups. A graph database can use relationships, not indexes, for fast traversals

A general-purpose graph database relationships can be traversed in either direction (tail to head, or head to tail) extremely cheaply. Starting from a given VcfRun or a given allele, a graph processing engine can find the related other one of those two at a very low computation cost.

In certain embodiments, systems and methods of the invention use native graph storage. If index-free adjacency is the key to high-performance traversals, queries, and writes, then one key aspect of the design of a graph database is the way in which graphs are stored. An efficient, native graph storage format supports extremely rapid traversals for arbitrary graph algorithms an important reason for using graphs.

A graph database such as Neo4j stores graph data in a number of different store files. Each store file may contain the data for a specific part of the graph (e.g., nodes, relationships, properties). The division of storage responsibilities—particularly the separation of graph structure from property data—facilitates performant graph traversals, even though it means the user's view of their graph and the actual records on disk are structurally dissimilar. FIGS. 7-10 illustrates a node and relationship storage structure as implemented by a graph database of the invention.

FIG. 7 diagrams a high-level architecture 701 of systems of certain embodiments of the invention. From the bottom-up, systems may operate using files on disk 733. Record files 739 provide a basic level of storage to support the file system cache 741. The object cache 747 is kept at a high level for rapid access as discussed herein. Additionally, the disks 733 can store a transaction log 725, which is written to by a transaction management module 721. A graph database such as Neo4j includes or provides a traversal API 755, core API 705, and a query language 713 such as Cypher.

FIG. 8 illustrates the structure of nodes 801 and relationships 809 on disk as may be deployed within a physical structure of systems of the invention. The node store file stores node records. Every node created in the user-level graph ends up in the node store. Preferably, the node store is a fixed-size record store. While the precise values or traits may be varied as necessary or best-suited to the invention, in the illustrated embodiment, each node record 801 is nine bytes in length. Fixed-size records enable fast lookups for nodes in the store file. To illustrate, if a node has id 100, then it can be known that its record begins 900 bytes into the file. Based on this format, the database can directly compute a record's location, at cost O(l), rather than performing a search, which would be cost O(log n). It is noted that fixed-size record stores provide an improvement to a computer in the sense that information storage efficiently exploits the physical storage device for very fast retrieval and very fast look-ups. Thus, genetic queries according to methods and systems of the invention actually proceed faster at a hardware level than prior art approaches—the computer itself is sped up by the implementations described.

The first byte of a node 801 record is the in-use flag. This tells the database whether the record is currently being used to store a node. The next four bytes represent the ID of the first relationship connected to the node, and the last four bytes represent the ID of the first property for the node. The node record is lightweight and contains just pointers to lists of relationships and properties.

Correspondingly, relationships are stored in a relationship store file Like the node store, the relationship store consists of fixed-sized records—in this case each relationship record 809 is 33 bytes long. Each relationship record 809 contains the IDs of the nodes at the start and end of the relationship, a pointer to the relationship type (which is stored in the relationship type store), and pointers for the next and previous relationship records for each of the start and end nodes. These last pointers are part of what is often called the relationship chain.

The node and relationship stores are concerned only with the structure of the graph, not its property data. Both stores use fixed-sized records so that any individual record's location within a store file can be rapidly computed given its ID. The significance can hardly be overstated: the described structure improves the operation of the hardware itself.

Using the described structures, given the way that the various store files are stored on disk, graph processing operations are low-cost. Each of the node records contains a pointer to that node's first property and first relationship in a relationship chain. To read a node's properties, one may follow the singly linked list structure beginning with the pointer to the first property. To find a relationship for a node, one may follow that node's relationship pointer to its first relationship and then follow the doubly linked list of relationships for that particular node (that is, either the start node doubly linked list, or the end node doubly linked list) until the relationship of interest is found.

Having found the record for the relationship of interest, that relationship's properties can be read (if there are any) using the same singly linked list structure as is used for node properties, or the node records can be examined for the two nodes the relationship connects using its start node and end node IDs. These IDs, multiplied by the node record size, give the immediate offset of each node in the node store file.

In some embodiments, systems and methods of the invention use doubly-linked lists in the relationship store. It is noted that a relationship record 809 can be thought of as “belonging” to two nodes—the start node and the end node of the relationship. To avoid storing two relationship records and to make the relationship record belong to both the start node and the end node, there are pointers (aka record IDs) for two doubly linked lists: one is the list of relationships visible from the start node; the other is the list of relationships visible from the end node. This provide rapid iteration through that list in either direction, and efficient insertion or deletion of relationships.

Choosing to follow a different relationship involves iterating through a linked list of relationships until a candidate matching the correct type or having some matching property value is found. The found relationship gives a new ID. The new ID is multiplied by record size as a new pointer and the traversal continues. With fixed-sized records and pointer-like record IDs, traversals are implemented simply by chasing pointers around a data structure, which can be performed at very high speed. To traverse a particular relationship from one node to another, the database performs several cheap ID computations (these computations are much cheaper than searching global indexes, as would be required if faking a graph in a non-graph native database). First, from a given node record, the first record in the relationship chain is located by computing its offset into the relationship store—that is, by multiplying its ID by the fixed relationship record size (e.g., 33 bytes). This gets to the right record in the relationship store. Then, from the relationship record, look in the second node field to find the ID of the second node. Multiply that ID by the node record size (e.g., nine bytes) to locate the correct node record in the store.

In addition to the node and relationship stores, which contain the graph structure, systems include the property store files. These store the user's key-value pairs. Properties may be attached to both nodes and relationships. The property stores, therefore, are referenced from both node and relationship records. Records in the property store are physically stored in a file. As with the node and relationship stores, property records are of a fixed size. Each property record consists of four property blocks and the ID of the next property in the property chain. Properties are held as a singly linked list on disk as compared to the doubly linked list used in relationship chains. Each property occupies between one and four property blocks—a property record can, therefore, hold four properties. A property record holds the property type and a pointer to the property index file, which is where the property name is stored. For each property's value, the record contains either a pointer into a dynamic store record or an inlined value. The dynamic stores allow for storing large property values. A graph database may optimize storage where it inlines some properties into the property store file directly. This happens when property data can be encoded to fit in one or more of a record's four property blocks. In practice this means that data like variant calls can be inlined in the property store file directly, rather than being pushed out to the dynamic stores. This results in reduced I/O operations and improved throughput, because only a single file access is required.

In addition to in-lining certain compatible property values, a graph database can also reference long values as property names (e.g., complete journal article titles and citations). In such cases, property names are indirectly referenced from the property store through the property index file. The property index allows all properties with the same name to share a single record, and thus for repetitive graphs achieves considerable space and I/O savings.

To improve the performance characteristics of mechanical/electronic mass storage de-vices, many graph databases use in-memory caching to provide probabilistic low latency access to the graph. Neo4j uses a two-tiered caching architecture to provide this functionality.

The lowest tier in the Neo4j caching stack is the file system cache 741. The file system cache 741 is a page-affined cache, meaning the cache divides each store into discrete regions, and then holds a fixed number of regions per store file. The actual amount of memory to be used to cache the pages for each store file can be fine-tuned, though in the absence of input from the user, Neo4j will use sensible default values based on the capacity of the underlying hardware. Pages are evicted from the cache based on a least-frequently-used (LFU) cache policy.

The file system cache 741 is particularly beneficial when related parts of the graph are modified at the same time such that they occupy the same page. This is a common pattern for writes, where whole sub-graphs (such as a patient's NGS results and associated carrier screening report) are written to disk in a single operation, rather than discrete nodes and relationships.

A graph database may be manipulated through a query language, which can be either imperative or declarative. One such language is the Cypher query language. Cypher is a declarative graph query language for Neo4j that allows for expressive and efficient querying and updating of the graph store. Cypher contains a variety of clauses, some of the most common of which include MATCH and WHERE. These functions are slightly different than in SQL. MATCH is used for describing the structure of the pattern searched for, primarily based on relationships, and WHERE is used to add additional constraints to patterns. Cypher additionally contains clauses for writing, updating, and deleting data. CREATE and DELETE are used to create and delete nodes and relationships. SET and REMOVE are used to set values to properties and add labels on nodes.

Systems and methods of the invention provide very rapid transactions, idiomatic queries, and an excellent ability to “scale up” with very large data sizes. The topic of scale has become more important as data volumes have grown. Graph databases don't suffer the same latency problems as traditional relational databases, where the more data that exists in tables—and in indexes—the longer the join operations. With a graph database, most queries follow a pattern whereby an index is used simply to find a starting node (or nodes). The remainder of the traversal then uses a combination of pointer chasing and pattern matching to search the data store. What this means is that, unlike relational databases, performance does not depend on the total size of the dataset, but only on the data being queried. This leads to performance times that are nearly constant (i.e., are related to the size of the result set), even as the size of the dataset grows. Throughput, speed, and scalability of graph databases make them suited to genetic analysis and reporting. Given the input/output-intensive nature of such sequencing, variant-calling, genotyping, and clinical reporting, a typical operation reads and writes a set of related data. In other words, the application performs multiple operations on a logical sub-graph within the overall dataset. With a graph database such multiple operations can be rolled up into larger, more cohesive operations. Further, with a graph-native store, executing each operation takes less computational effort than the equivalent relational operation. Graphs scale by doing less work for the same outcome.

FIG. 9 illustrates the use of a variant node 901 in a graph database to store a description of a mutation. The first byte of the variant node 901 record is set to show that node 901 is in use. The next four bytes of node 901 represent the ID of the first relationship connected to the node. Through the ID of that first relationship, node 901 thus includes a pointer to an adjacent node (adjacent by definition, since the relationship is identified by the four bytes in node 901). The last four bytes of node 901 represent the ID of the first property for the node.

To read the first property for node 901, one may follow the singly linked list structure to the appropriate property record in the property store. Property records in the property store are of a fixed size and each property record consists of four property blocks and the ID of the next property in the chain. The property record holds the property type (here, “variant”) and a pointer to the property index file, which is where the property name is stored. For each property's value, the record either points to a dynamic store or an inline record. Here, the parser operating via the logic mapped in FIG. 4 produces a record of a mutation (by parsing that record from the VCF file) and can store that mutation in the property index file. Thus the property index file for a variant node preferably includes a description of a mutation.

A description of a mutation may be provided according to a systematic nomenclature. For example, a variant can be described by a systematic comparison to a specified reference which is assumed to be unchanging and identified by a unique label such as a name or accession number. For a given gene, coding region, or open reading frame, the A of the ATG start codon is denoted nucleotide +1 and the nucleotide 5′ to +1 is −1 (there is no zero). A lowercase g, c, or m prefix, set off by a period, indicates genomic DNA, cDNA, or mitochondrial DNA, respectively.

A systematic name can be used to describe a number of variant types including, for example, substitutions, deletions, insertions, and variable copy numbers. A substitution name starts with a number followed by a “from to” markup. Thus, 199A>G shows that at position 199 of the reference sequence, A is replaced by a G. A deletion is shown by “del” after the number. Thus 223delT shows the deletion of T at nt 223 and 997-999del shows the deletion of three nucleotides (alternatively, this mutation can be denoted as 997-999delTTC). In short tandem repeats, the 3′ nt is arbitrarily assigned; e.g. a TG deletion is designated 1997-1998delTG or 1997-1998del (where 1997 is the first T before C). Insertions are shown by ins after an interval. Thus 200-201insT denotes that T was inserted between nts 200 and 201. Variable short repeats appear as 997(GT)N−N′. Here, 997 is the first nucleotide of the dinucleotide GT, which is repeated N to N′ times in the population.

Variants in introns can use the intron number with a positive number indicating a distance from the G of the invariant donor GU or a negative number indicating a distance from an invariant G of the acceptor site AG. Thus, IVS3+1C>T shows a C to T substitution at nt+1 of intron 3. In any case, cDNA nucleotide numbering may be used to show the location of the mutation, for example, in an intron. Thus, c.1999+1C>T denotes the C to T substitution at nt+1 after nucleotide 1997 of the cDNA. Similarly, c.1997-2A>C shows the A to C substitution at nt-2 upstream of nucleotide 1997 of the cDNA. When the full length genomic sequence is known, the mutation can also be designated by the nt number of the reference sequence.

Relative to a reference, a patient's genome may vary by more than one mutation, or by a complex mutation that is describable by more than one character string or systematic name. The invention further provides systems and methods for describing more than one variant using a systematic name. For example, two mutations in the same allele can be listed within brackets as follows: [1997G>T; 2001A>C]. Systematic nomenclature is discussed in den Dunnen & Antonarakis, 2003, Mutation Nomenclature, Curr Prot Hum Genet 7.13.1-7.13.8 as well as in Antonarakis and the Nomenclature Working Group, 1998, Recommendations for a nomenclature system for human gene mutations, Human Mutation 11:1-3. By such means, a mutation can be described in the property index file of a variant node.

While described here with reference to FIG. 9 as a “variant node”, it will be appreciated that node 901 can be instantiated or used as any type, with the type being stored in the property store.

FIG. 10 illustrates a simple example in which an allele node is used to show that an allele includes a certain mutation by representing the mutation using a variant node and representing a relationship between the allele node and the variant node with a “HAS_VARIANT” type relationship. This illustrates the simplicity of connecting alleles to variants using relationships. After the variant is created, literature references can be added to the variant.

FIG. 11 shows elements of a graph database in which a variant has been connected to two nodes, each for a literature reference. From this setup emerges one of the powerful applications of a graph database in processing results from NGS sequencing data. If variant changes are made, those variant changes can be tracked within systems of the invention without requiring upsetting the structure of the existing database.

To illustrate the invention by an example, a patient sample could be sequenced via NGS technologies and the sequencing results could include, in a VCF file, a description of a mutation in that patient's mitochondrial genome. A variant node is used and a property of that node (e.g., in a property index file) is used to describe that mutation as m.593T>C. A relationship is created to shown that the mutation is described in a literature reference. The relationship is a pointer to a LitRef node and the LitRef node points to a property index file that with information about the literature reference. The property index file contains Zhang et al., 2011, Is mitochondrial tRNAphe variant m.593T>C a synergistically pathogenic mutation in Chinese LHON families with m.11778G>A?, PLoS ONE 6(10):e26511. Based on the synergistic pathogenesis alluded to by the literature reference, a geneticist or curator may deem it important to flag instances in which a patient has both m.593T>C and m.11778G>A in their genome. This example illustrates the real power of a graph database and index-free adjacency. A query can be initiated that starts at the LitRef node just described and traverses to the variant node. That query can traverse to the sample node for that patient and even to a node for the patient. That query can then—by its own terms—traverse from the patient or sample node examining for the presence of a second variant node representing m.11778G>A. The query can be programmed to, in the absence of said second variant node, classify the mutation as benign. The query can be programmed to, in the presence of said second variant node, classify the mutation as pathogenic. Intermediate labels or other categories can also be used. Since the query is traversing across a graph database, a comprehensive index-based look-up is not required as would be required in prior art RDMSs.

It is important to note that the “graph” of the described graph databases follows the counter-intuitive path of connecting things of un-related categories. Although it is not the primary structure or purpose described herein, one may imagine embodiments in which a graph has a horizontal structure connecting entities that are essentially similar in nature so that the database maps a natural phenomenon. For example, a graph database could represent protein interactions using the edges (aka pointers or relationships) to represent interactions between proteins and thus influxes of data would expand the graph “horizontally”. However, the invention is unlike the protein interaction example in that the graph expands “vertically” outside of a set of natural phenomena. Since a sample can have a node, the graph can reach to laboratory management systems and receive from or provide information to, for example, sample chain of custody modules. With NGS results from that sample, the graph can leap vertically to a genetic plane and represent human mutations that are being discovered. For NGS carrier screening application, the graph can reach vertically into a different category to represent medical literature, and can go on to be used patient reports. The power of this novel vertical structure is shown by the illustration of use of the invention for reporting carrier screening results.

FIG. 12 illustrates a graph database in which a variant has been connected to two nodes, each for a literature reference and in which updated information about the variant has been introduced in two changes. For example, node 17451 may represent a specific mutation such as a SNP (e.g., G at a certain position). Node 17454 could be created when A is observed at that position.

Systems and methods of the invention support a plurality of different use cases and applications. For example, if a graph database is used in support of NGS carrier screening, one capability that will emerge is support for evaluating and reporting allele frequency.

For example, where a practitioner wants to know, across all included research consenting data, what is the frequency of a certain allele, the graph database can easily be queried for that.

FIG. 13 presents an example database that may be queried for allele frequency.

Using—for example, in Cypher—the following (pseudo) code produces the desired result.

MATCH (a:Allele)←(sd:S ampleData)→(s:Sample)→p:Patient) RETURN a,count(distinct p)

Another illustrative use case for application of a graph database is the curation of variants. As was illustrated by FIGS. 10-12. The curation of variants involves taking variants (i.e. genetic mutations) that have been picked up through a sequencing platform and then looking through the literature for references to evaluate how common the variant is and whether it is identified as pathogenic, benign, or somewhere in-between. This can be supported and modeled by tracking three things: connecting allele to a variant; variant and variant changes; and literature references per variant. To illustrate, a geneticist may observe review a patient's NGS sequencing results and observe the presence of a poly-T variant. The geneticist may connect this variant to an allele of the cystic fibrosis transmembrane conductance receptor (CFTR) gene located on the long arm of chromosome 7 (e.g., as shown in FIG. 10). The geneticist may further observe that this variant is described by a literature reference and connect the variant object to two different LitRef objects such as one for each of Rowntree and Harris, The phenotypic consequences of CFTR mutations, Ann Hum Gen 67:471-485 (2003) and Kreindler, Cystic fibrosis: exploiting its genetic basis in the hunt for new therapies, Pharmacol Ther 125(2):219-229 (2010) (e.g., according to the diagram of FIG. 11). Moreover the geneticist may observe that the mutation (the poly-T) is a novel poly-T variant in the acceptor splice site of intron 8 of CFTR in cis with R117H (i.e., c.350G>A based on GenBank cDNA reference sequence NM000492.3). In this instance, the geneticist may want to update the graph for this patient by connecting the poly-T mutation to a variant object for c.350G>A (e.g., as seen in FIG. 12). To further illustrate, the chain of updated variants may reveal that the patient has an allele with the T5 poly-T variant, which evidence suggests plays a role in in pathogenic alternate splicing or exon skipping. Moreover, the geneticist may further consider the data and determine that, in-fact, the patient's allele includes a T6 form of the poly-T variant and may update the variant nodes to so reflect. Here, with the addition of a T6 node, other content need not be modified. The geneticist may add a LitRef node for Huang, et al., Comparative analysis of common CFTR polymorphisms poly-T, TG-repeats and M470V in a healthy Chinese population, World J Gastroenterol 14(12):1925-30 (2008). Thus if the NGS screening gave results indicating a R117H with T6 variant, methods and systems of the invention can be used to relate this clinical results data to the existing infrastructure of medical information on one level and back to the patient via the sample (through the VCF files and instrument run) on another level. Since a graph database preferably with index-free adjacency is used for each node, those connections can be traversed to provide a report to the patient's attending physician, where the report shows the patient to be R117H T6 and gives the relevant literature with information about treatment and outcomes. Since a graph database is used, the traversals are very fast and traversal times do not increase with increasing volumes of database contents as queries times must so increase in the context of prior art relational databases.

As one skilled in the art would recognize as necessary or best-suited for performance of the methods of the invention, a computer system or machines of the invention include one or more processors (e.g., a central processing unit (CPU) a graphics processing unit (GPU) or both), a main memory and a static memory, which communicate with each other via a bus.

FIG. 14 diagrams a system 1500 suitable for performing methods of the invention. As shown in FIG. 14, system 1500 may include one or more of a server computer 1513, a terminal 1567, a sequencer 1501, a sequencer computer 1533, a computer 1549, or any combination thereof. Each such computer device may communicate via network 1509. Sequencer 1501 may optionally include or be operably coupled to its own, e.g., dedicated, sequencer computer 1533 (including any input/output mechanisms (I/O), processor, and memory). Additionally or alternatively, sequencer 1501 may be operably coupled to a server 1513 or computer 1549 (e.g., laptop, desktop, or tablet) via network 1509. Computer 1549 includes one or more processor, memory, and I/O. Where methods of the invention employ a client/server architecture, any steps of methods of the invention may be performed using server 1513, which includes one or more of processor, memory, and I/O, capable of obtaining data, instructions, etc., or providing results via an interface module or providing results as a file. Server 1513 may be engaged over network 1509 through computer 1549 or terminal 1567, or server 1513 may be directly connected to terminal 1567. Terminal 1567 is preferably a computer device. A computer according to the invention preferably includes one or more processor coupled to an I/O mechanism and memory.

A processor may be provided by one or more processors including, for example, one or more of a single core or multi-core processor (e.g., AMD Phenom II X2, Intel Core Duo, AMD Phenom II X4, Intel Core i5, Intel Core i& Extreme Edition 980X, or Intel Xeon E7-2820).

An I/O mechanism may include a video display unit (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device (e.g., a keyboard), a cursor control device (e.g., a mouse), a disk drive unit, a signal generation device (e.g., a speaker), an accelerometer, a microphone, a cellular radio frequency antenna, and a network interface device (e.g., a network interface card (NIC), Wi-Fi card, cellular modem, data jack, Ethernet port, modem jack, HDMI port, mini-HDMI port, USB port), touchscreen (e.g., CRT, LCD, LED, AMOLED, Super AMOLED), pointing device, trackpad, light (e.g., LED), light/image projection device, or a combination thereof.

Memory according to the invention refers to a non-transitory memory which is provided by one or more tangible devices which preferably include one or more machine-readable medium on which is stored one or more sets of instructions (e.g., software) embodying any one or more of the methodologies or functions described herein. The software may also reside, completely or at least partially, within the main memory, processor, or both during execution thereof by a computer within system 1500, the main memory and the processor also constituting machine-readable media. The software may further be transmitted or received over a network via the network interface device.

While the machine-readable medium can in an exemplary embodiment be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “machine-readable medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention. Memory may be, for example, one or more of a hard disk drive, solid state drive (SSD), an optical disc, flash memory, zip disk, tape drive, “cloud” storage location, or a combination thereof. In certain embodiments, a device of the invention includes a tangible, non-transitory computer readable medium for memory. Exemplary devices for use as memory include semiconductor memory devices, (e.g., EPROM, EEPROM, solid state drive (SSD), and flash memory devices e.g., SD, micro SD, SDXC, SDIO, SDHC cards); magnetic disks, (e.g., internal hard disks or removable disks); and optical disks (e.g., CD and DVD disks).

Components of system 1500 may be under the control of a carrier screening service provider and may be operated to obtain data representing a mutation in a genome of an individual, use a variant node in a graph database to store a description of the mutation (while storing, in the variant node, a pointer to an adjacent node that provides information about a clinical significance of the variant), and query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual. Functionality of server computer 1513 may be provided by an outside vendor such as Amazon Web Services or Amazon's EC2. In fact, the carrier screening entity who is analyzing the mutations from the sample may not and need not have actual knowledge of the physical location and type of computers that provide server computer(s) 1513. It is enough that the entity have access to and the ability to control at least a portion of each of one or more of server computer 1513. In some embodiments, a sequencing instrument 1501 is employed (e.g., an IIlumina HiSeq 2000), which itself includes a sequencer computer 1533). The sample from the patient may be received from an outside source (e.g., from a phlebotomy facility down the hall or may be sent by courier (e.g., in an Eppendorf tube). Generally, the service provider will have access to and use a computer 1549 for coordinating methods of the invention. It is important to note that any given computer is optional but typically at least one of the depicted computer (sequencer computer 1533, local computer 1549, or server computer 1513) will be used to perform steps of the methods of the invention. In some embodiments, sequencer 1501 is operated by an outside service provider in support of or on order of the carrier screening entity. Thus generally the carrier screening professional has access to or control over components of the system.

INCORPORATION BY REFERENCE

References and citations to other documents, such as patents, patent applications, patent publications, journals, books, papers, web contents, have been made throughout this disclosure. All such documents are hereby incorporated herein by reference in their entirety for all purposes.

EQUIVALENTS

Various modifications of the invention and many further embodiments thereof, in addition to those shown and described herein, will become apparent to those skilled in the art from the full contents of this document, including references to the scientific and patent literature cited herein. The subject matter herein contains important information, exemplification and guidance that can be adapted to the practice of this invention in its various embodiments and equivalents thereof.

Claims

1. A system for describing genetic information, the system comprising:

at least one computer comprising memory coupled to a processor, the system having at least a portion of a graph database stored therein, wherein the system is operable to: obtain data representing a mutation in a genome of an individual; use a node in the graph database to store a description of the mutation; store, in the node, a pointer to an adjacent node that provides information about a clinical significance of the mutation; and query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual.

2. The system of claim 1, wherein the system is operable to obtain the data representing the mutation by receiving at least one sequence read file that includes the data.

3. The system of claim 2, further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

4. The system of claim 1, wherein the data representing the mutation is obtained as part of a file.

5. The system of claim 4, wherein the file has a format selected from the group consisting of variant call format; sequence alignment map; binary alignment map; FASTA; and FASTQ.

6. The system of claim 4, operable to represent the file as a file node in the graph database and store, in the variant node, a pointer to the file node.

7. The system of claim 6, further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

8. The system of claim 1, wherein the data representing the mutation comprises a description of the mutation as a variant of a reference human genome.

9. The system of claim 8, wherein the description of the mutation is obtained from a VCF record in a VCF file.

10. The system of claim 9, further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

11. The system of claim 1, further operable to:

obtain sequencing data representing a plurality of mutations in the genome of the individual, the plurality of mutations being represented as variant calls relative to a human genome reference;
use, for each of the plurality of mutations, a corresponding variant node in the graph database to store a description of that mutation; and
link the individual to an allele node based on the plurality of mutations.

12. The system of claim 11, wherein the graph database comprises:

nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and
edges defining relationships between pairs of the nodes.

13. The system of claim 12, further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

14. The system of claim 1, wherein the graph database comprises:

nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and
edges defining relationships between pairs of the nodes.

15. The system of claim 14, further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.

16. A method for analyzing mutations, the method comprising:

obtaining data representing a mutation in a genome of an individual;
using a node in a graph database to store a description of the mutation;
storing, in the node, a pointer to an adjacent node that provides information about a clinical significance of the mutation; and
querying the graph database to provide a report of the clinical significance of the mutation in the genome of the individual.

17. The method of claim 16, wherein obtaining the data representing the mutation comprises

obtaining a sample that includes a nucleic acid from the individual; and
sequencing the nucleic acid to obtain a sequence read file that includes the data.

18. The method of claim 17, further comprising representing the sample in the graph database using a sample node and connecting the sample node via a pointer to a read file node representing the sequence read file and metadata associated with the data.

19. The method of claim 16, wherein the data representing a mutation is obtained as part of a file.

20. The method of claim 19, wherein the file has a format selected from the group consisting of variant call format; sequence alignment map; binary alignment map; FASTA; and FASTQ.

21. The method of claim 19, further comprising representing the file as a file node in the graph database and storing in the mutation node a pointer to the file node.

22. The method of claim 16, wherein the data representing a mutation comprises a description of the mutation as a variant of a reference human genome.

23. The method of claim 22, wherein the description of the mutation is provided as a VCF record in a VCF file.

24. The method of claim 16, further comprising:

obtaining sequencing data representing a plurality of mutations in the genome of the individual, each of the plurality of mutations being represented as variant calls relative to a human genome reference; and
using, for each of the plurality of mutations, a corresponding variant node in the graph database to store a description of that mutation.

25. The method of claim 16, wherein the graph database comprises:

nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and
edges defining relationships between pairs of the nodes.
Patent History
Publication number: 20160048608
Type: Application
Filed: Aug 14, 2015
Publication Date: Feb 18, 2016
Inventors: Alexander Frieden (Somerville, MA), Caleb J. Kennedy (Arlington, MA), Xavier S. Haurie (Belmont, MA)
Application Number: 14/826,595
Classifications
International Classification: G06F 17/30 (20060101); G06F 19/28 (20060101);