METHODS AND SYSTEMS FOR PROCESSING GENOMIC DATA
Methods and systems for processing biological sequence data, such as genomic data, are disclosed. In one implementation, a set of biological sequences, such as DNA sequences, may be evaluated and one or more reference sequences may be determined or selected based on the set of biological sequences, which may include iterative determination or selection. The set of biological sequences may be compressed using the one or more reference sequences. Processing may include generating delta representations associated with the biological sequences, as well as generating dictionary information, which may be used for further compression. Metadata may be included in the compressed data. The compressed data may be stored in a compressed data file, which may be in a compressed genomic database or other data storage medium.
This application is a continuation of application Ser. No. 12/828,234, entitled METHODS AND SYSTEMS FOR PROCESSING GENOMICS DATA, filed on Jun. 30, 2010, which claims priority under 35 U.S.C. §119(e) to U.S. Provisional Patent Application Ser. No. 61/358,854, entitled METHODS AND SYSTEMS FOR PROCESSING GENOMICS DATA, filed on Jun. 25, 2010, the content of which is incorporated by reference herein in its entirety for all purposes.
DESCRIPTION OF THE TEXT FILE SUBMITTED ELECTRONICALLYThe contents of the text file submitted electronically herewith are incorporated herein by reference in their entirety: A computer readable format copy of the Sequence Listing (filename: ANNA_001_00US_SeqList_ST25.txt, date recorded: Oct. 15, 2010, file size 2 kilobytes).
Please insert the sequence listing, submitted electronically herewith, after the abstract.
FIELDThis application is directed generally to the processing of genomic and other biological sequence data. More particularly, but not exclusively, the application relates to methods and systems for compressing, storing, processing and transmitting biological sequence data and associated information.
BACKGROUNDDeoxyribonucleic acid (“DNA”) sequencing is the process of determining the ordering of nucleotide bases (adenine (A), guanine (G), cytosine (C) and thymine (T)) in molecular DNA. Knowledge of DNA sequences is invaluable in basic biological research as well as in numerous applied fields such as but not limited to medicine, health, agriculture, livestock, population genetics, social networking, biotechnology, forensic science, security, and other areas of biology and life sciences.
Sequencing has been done since the 1970s, when academic researchers began using laborious methods based on two dimensional chromatography. Due to the initial difficulties in sequencing in the early 1970s, the cost and speed could be measured in scientist years per nucleotide base as researchers set out to sequence the first restriction endonuclease site with just a handful of bases.
Thirty years later, the entire 3.2 billion bases of the human genome have been sequenced, with a first complete draft of the human genome done at a cost of about three billion dollars. Since then sequencing costs have rapidly decreased. Today, many expect the cost of sequencing the human genome to be in the hundreds of dollars or less in the near future, with the results available in minutes, much like a routine blood test.
As the cost of sequencing the human genome continues to decrease, the number of individuals having their DNA sequenced for medical, as well as other purposes, will likely explode. Moreover, sequencing of other organisms will likely also increase for research purposes as well as disease analysis. Because of the large size of DNA sequences for many organisms, including humans, this explosion will lead to problems with DNA sequence data storage, transmission and analysis. Accordingly, there is a need in the art to address these problems as well as others.
SUMMARYThis disclosure relates generally to methods and systems for compressing, storing, processing, analyzing and transmitting biological sequence data and associated information.
In one aspect, the disclosure is directed to a computer-implemented method for processing a plurality of biological data sequences, the method comprising evaluating the plurality of data sequences, defining at least a first reference sequence based upon the evaluating, comparing the plurality of data sequences to the first reference sequence and generating, based upon the comparing, a first plurality of delta representations of the plurality of data sequences.
In another aspect, the disclosure is directed to a computer-implemented method of compressing a plurality of biological data sequences, the method comprising creating, using a reference sequence, a plurality of first delta representations of the plurality of data sequences, defining a modified reference sequence by modifying the reference sequence based upon one or more characteristics of the plurality of first delta representations and generating, using the modified reference sequence, a plurality of second delta representations of the plurality of data sequences.
In another aspect, the disclosure is directed to a computer-implemented method for compressing biological sequence data, the method comprising processing the sequence data relative to a reference sequence and generating, based upon the processing, a delta representation of the sequence data wherein the delta representation includes embedded data associated with the sequence data.
In another aspect, the disclosure is directed to a computer-implemented method of compressing a plurality of biological data sequences contained within a database, the method comprising selecting a reference sequence based upon the plurality of data sequences, compressing the plurality of data sequences using the reference sequence to yield a plurality of first delta representations of the plurality of data sequences, generating a plurality of second delta representations by transforming the plurality of first delta representations using a dictionary procedure and creating a modified reference sequence by modifying the reference sequence based upon the plurality of second delta representations.
In another aspect, the disclosure is directed to a computer-implemented method for compressing a plurality of biological sequences corresponding to a particular species, the method comprising selecting a first reference sequence associated with the particular species, comparing the plurality of biological sequences to the reference sequence and generating, based upon the comparing, a first plurality of delta representations of the plurality of biological sequences. The particular species may be an animal species, plant species, or other species or organism such as bacteria, viruses, fungi, other organisms or agents that use DNA, RNA or biopolymer as genomic material, including, for example, prions. The first reference sequence may be selected from the plurality of biological sequences or may be selected independently of the plurality of biological sequences.
In another aspect, the disclosure is directed to a semiconductor device or devices for performing the computer-implemented methods described above.
In another aspect, the disclosure is directed to a computer system for performing the computer-implemented methods described above.
In another aspect, the disclosure is directed to means for performing the computer-implemented methods described above.
In another aspect, the the disclosure is directed to a computer program product including a computer-readable medium containing instruction for execution by a processor to perform the methods described above.
In another aspect, the disclosure is directed to a computer data storage product including genomic sequence data encoded using the computer-implemented methods described above.
Additional aspects are further described below in conjunction with the appended drawings.
The present application may be more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, wherein:
This disclosure relates generally to processing, storage and transmission of biological sequences such as DNA sequences.
In one aspect, the disclosure is directed to a computer-implemented method for processing a plurality of data sequences, the method comprising evaluating the plurality of data sequences, defining at least a first reference sequence based upon the evaluating, comparing the plurality of data sequences to the first reference sequence and generating, based upon the comparing, a first plurality of delta representations of the plurality of data sequences. The delta representations may include one or more operations capable of being performed to regenerate one of the plurality of data sequences based upon the first reference sequence. The operations may be associated with substitution processing and/or insertion/deletion processing.
The method may further include defining a second reference sequence, performing a comparison of the plurality of delta representations to the second reference sequence and generating, based upon the comparison, a second plurality of delta representations of the plurality of data sequences. The second reference sequence may be generated by modifying, based upon an evaluation of the first plurality of delta representations, the first reference sequence. The method may further include generating, using the second reference sequence, a second plurality of delta representations of the plurality of data sequences. The generating a second reference sequence may include identifying at least one sequence segment common to plural ones of the first plurality of delta representations and modifying the first reference sequence in accordance with the at least one sequence segment. The plurality of data sequences may comprise DNA data sequences, and at least one sequence segment may comprises a mutation. The plurality of data sequences may be stored in a database, and the database may further including embedded data associated with segments of the sequences.
The method may further include replacing at least a portion of the embedded data with address information associated with a dictionary structure, where the at least a portion of the embedded data is stored within the dictionary structure. The plurality of data sequences may comprise DNA data sequences and the embedded data may comprise correlative information relating to mutations within the DNA data sequences. The correlative information may include pharmacological information, clinical result information and/or other information. The one or more operations of the method may comprise substitution, insertion and/or deletion.
The evaluating may include determining an average distance of each of the plurality of data sequences from other of the plurality of data sequences. The reference sequence may comprise one of the plurality of data sequences that has a minimum average distance from others of the plurality of data sequences. The evaluating may include determining, for sequence positions of the plurality of reference sequences, entries for the sequence positions common to plural of the plurality of data sequences. The defining may include assigning the entries to respective sequence positions of the first reference sequence.
In another aspect, the disclosure is directed to a computer-implemented method of compressing a plurality of data sequences, the method comprising creating, using a reference sequence, a plurality of first delta representations of the plurality of data sequences, defining a modified reference sequence by modifying the reference sequence based upon one or more characteristics of the plurality of first delta representations and generating, using the modified reference sequence, a plurality of second delta representations of the plurality of data sequences. The method may further comprise replacing a portion of one of the plurality of first delta representations with address information associated with a dictionary structure, with the portion of the first delta representation being stored within the dictionary structure. The method may further comprise replacing a portion of one of the plurality of second delta representations with address information associated with a dictionary structure, with the portion of the second delta representation being stored within the dictionary structure. The one or more characteristics may relate to a set of delta values associated with one or more predefined regions of the plurality of data sequences. The plurality of sequences may comprise DNA sequences, and the one or more predefined regions may be known to be associated with a disease condition.
The defining may include identifying at least one sequence segment common to plural ones of the plurality of first delta representations and modifying the reference sequence in accordance with the at least one sequence segment. The plurality of data sequences may comprise DNA data sequences and the at least one sequence segment may comprise a mutation. The plurality of data sequences may be stored within a database, and the database may further including embedded data associated with segments of the plurality of data sequences.
The method may further including replacing at least a portion of the embedded data with address information associated with a dictionary structure, and the at least a portion of the embedded data may be stored within the dictionary structure. The plurality of data sequences may comprise DNA data sequences and the embedded data may comprises correlative information relating to mutations within the DNA data sequences. The correlative information may include pharmacological information and/or clinical result information and/or other information.
The generating may include determining differences between the plurality of data sequences and the modified reference sequence. The generating may include specifying, for each of the differences, a corresponding position in the modified reference sequence, an operation, and a value associated with the operation. The operation corresponding to one of the differences may be substitution, insertion or deletion.
In another aspect, the disclosure is directed to a computer-implemented method for compressing sequence data, the method comprising processing the sequence data relative to a reference sequence and generating, based upon the processing, a delta representation of the sequence data wherein the delta representation includes embedded data associated with the sequence data. The method may further comprise replacing a first portion of the delta representation with address information associated with a dictionary structure, with the first portion of the delta representation being stored within the dictionary structure. The embedded data may be included within a second portion of the delta representation, and the method may further comprise replacing the second portion of the delta representation with address information associated with a dictionary structure, the second portion of the delta representation being stored within the dictionary structure.
The processing may include determining differences between the sequence data and the reference sequence. The generating may include specifying, for each of the differences, a corresponding position in the reference sequence, an operation, and a value associated with the operation. The operation corresponding to one of the differences may be substitution, insertion or deletion. The sequence data may comprise a sequence of monomers associated with a polymer. The polymer may be a biopolymer, which may be DNA or RNA.
The method may further include replacing compressed information within the delta representation with address information associated with a dictionary structure, with the compressed information being stored within the dictionary structure. The method may further including replacing at least a portion of the embedded data with address information associated with a dictionary structure, with the at least a portion of the embedded data being stored within the dictionary structure. The method may further including compressing embedded data associated with the sequence data by replacing the embedded data with address information associated with a dictionary structure with the embedded data is stored within the dictionary structure.
The sequence data may comprise DNA sequence data with the embedded data comprising correlative information relating to mutations within the DNA sequence data. The correlative information may includes pharmacological information and/or clinical result information and/or other information. The delta representation may include information relating to base modifications within the DNA sequence data. The base modifications may include one or more of methylation, carboxylation, and formylation. The base modifications may include one or more of deamination and/or any other base modification or analogs.
In another aspect, the disclosure is directed to a computer-implemented method of compressing a plurality of data sequences contained within a database, the method comprising selecting a reference sequence based upon the plurality of data sequences, compressing the plurality of data sequences using the reference sequence to yield a plurality of first delta representations of the plurality of data sequences, generating a plurality of second delta representations by transforming the plurality of first delta representations using a dictionary procedure and creating a modified reference sequence by modifying the reference sequence based upon the plurality of second delta representations.
The method may further include compressing the plurality of data sequences using the modified reference sequence. At least one of the plurality of first delta representations may include embedded data associated with a corresponding one of the plurality of delta representations. The dictionary procedure may include replacing at least a portion of the embedded data with address information associated with a dictionary structure, with the at least a portion of the embedded data being stored within the dictionary structure. The creating may include identifying at least one sequence segment common to plural ones of the plurality of first delta representations and modifying the reference sequence in accordance with the at least one sequence segment.
The plurality of data sequences may comprise DNA data sequences, and at least one sequence segment may include a mutation. The plurality of data sequences may be stored within a database, and the database may further include embedded data associated with segments of the plurality of data sequences. The plurality of data sequences may comprise DNA data sequences and the embedded data comprises correlative information relating to mutations within the DNA data sequences. The correlative information may include pharmacological information, clinical result information, and/or other data or information.
The compressing may include determining differences between the plurality of data sequences and the reference sequence. The compressing may include specifying, for each of the differences, a corresponding position in the reference sequence, an operation and a value associated with the operation. The operation corresponding to one of the differences may be substitution, insertion or deletion.
In another aspect, the disclosure is directed to a computer-implemented method for compressing a plurality of biological sequences corresponding to a particular species, the method comprising selecting a first reference sequence associated with the particular species, comparing the plurality of biological sequences to the reference sequence, and generating, based upon the comparing, a first plurality of delta representations of the plurality of biological sequences. The particular species may be an animal species, plant species, or other species or organism such as bacteria, viruses, fungi, or other organisms or agents that use DNA, RNA or biopolymer as genomic material, including, for example, prions. The first reference sequence may be selected from the plurality of biological sequences or may be selected independently of the plurality of biological sequences.
In another aspect, the disclosure is directed to a semiconductor device or devices for performing the computer-implemented methods described above.
In another aspect, the disclosure is directed to a computer system for performing the computer-implemented methods described above.
In another aspect, the disclosure is directed to means for performing the computer-implemented methods described above.
In another aspect, the disclosure is directed to a computer program product including a computer-readable medium containing instruction for execution by a processor to perform the methods described above.
In another aspect, the the disclosure is directed to a computer data storage product including biological sequence data encoded using the computer-implemented described above.
Various aspects of the present invention are described below. It should be apparent that the teachings herein may be embodied in a wide variety of forms and that any specific structure, function, or both being disclosed herein is merely representative and not intended to be limiting. Based on the teachings herein one skilled in the art should appreciate that an aspect disclosed herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus or system may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, such an apparatus or system may be implemented or such a method may be practiced using other structure, functionality, or structure and functionality in addition to or other than one or more of the aspects set forth herein. Furthermore, an aspect may comprise at least one element of a claim.
While the embodiments described below are generally based on sequences of DNA, it will be apparent that embodiments of the present invention may equally be implemented for processing other biological sequences, such as RNA sequences, protein sequences, or other types of biological sequences. Accordingly, the present invention is not in any way limited in application to DNA sequences, but may rather be applied to a wide range of other biological sequences, as well as, in some applications, other types of non-biological data sequences or computer generated or simulated sequences.
Overview of Genomic SequencingGenomic sequences are sequences of data describing genomic characteristics of a particular organism. The term “genomic” generally refers to data that both codes (also referred to as “genetic” data) as well as data that is non-coding. The term “genome” refers to an organism's entire hereditary information. Genomic sequencing is the process of determining a particular organism's genomic sequence.
The human genome, as well as that of other organisms, is made of four chemical units called nucleotide bases (also referred to herein as “bases” for brevity). These bases are adenine(A), thymine(T), guanine(G) and cytosine(C). Double stranded sequences are made of paired nucleotide bases, where each base in one strand pairs with a base in the other strand according to the Watson-Crick pairing rule: i.e., A pairs with T and C pairs with G (In RNA, Thymine is replaced with Uracil (U), which pairs with A).
A sequence is a series of bases, ordered as they are arranged in molecular DNA or RNA. For example, a sequence may include a series of bases arranged in a particular order, such as the following example sequence fragment:
The human haploid genome contains approximately 3 billion base pairs, which may be further broken down into 23 pairs of chromosomes. The 23 chromosomes include about 30,000 genes. While each individual's genome sequence is different, there is much redundancy between individuals of a particular genus, and in many cases there is also much redundancy across similar species. For example, in the human genome the sequences of two individuals are about 99.9% equivalent, and are therefore highly redundant. Viewed in another way, the number of differences in bases in sequences of different individuals is corresponding small. These differences may include differences in the particular nucleotide at a position in the sequence, also known as a single nucleotide polymorphism or SNP, as well as addition or subtraction of nucleotides between individuals' sequences at corresponding positions in the sequences.
Because of the enormous size of the human genome, as well as the genomes of many other organisms, storage and processing genomic sequences (which are typically separate sequences generated from a particular individual or organism, but may also be a sequence fragment, sub-sequence, sequence of a particular gene, etc.) creates problems with processing, memory storage, and data transmission. Consequently, it is typically important to store the sequences in as little space as possible. Moreover, it is typically important that no information is lost in storage and transmission. Accordingly, processing for storage or transmission of whole or partial sequences should include removing redundant information in a sequence in a lossless fashion.
Existing sequence storage techniques use coding for the four nucleotides (A, C, G and T) which may map then to characters in a text format. This sequence information may be mapped to binary data. For example, A may be mapped to binary 00, C may be mapped to 01, G to 10 and T to 11 as shown in
Five sequences, sequences 210-250, are shown, along with associated memory mappings of the sequences in memory locations 210M-250M, which may be in a memory device such as DRAM, SRAM, Flash, CAM, PCM, and the like may be in a database such as on a disk drive or other storage media such as DVD ROM, Blu-Ray, etc. In a memory or database, the information shown would require 5 times 40 bits or 200 bits. In this example the sequence size is very small, however, for typical sequences, such as a human sequence, each individual's sequence data would be approximately six billion bits long (i.e., about 6 Gb, or about .75 Gigabytes (GB)) if coded as shown.
Consequently, for a database having a relatively small number of sequence entries (for example, 1024 entries or 1K), the database size would approach one terabyte, which is impractical for storage, movement, processing or analyzing for widespread use with current computing technologies. However, in genomic sequences within species (and in many cases across species) the nucleotide bases are typically very similar between individuals, normally having very small deviations. This characteristic of DNA may be used, as further described subsequently herein, to effect coding for compression of sequence data as well as perform other processing and output data generation and distribution.
Variations in the DNA sequences of different individuals are a result of deviations (also known as mutations). For example, one type of mutation relates to substitutions of nucleotide bases at common or reference positions in the sequence. A base substitution (also known as a point mutation) is the result of one base in a sequence at a particular position or reference location being replaced with a different one (relative to another sequence, which may be a reference sequence from which other sequences are compared). A base substitution can be either a transition (e.g., between G and A, or C and
T) or a transversion (e.g., between G and its paired base C, or A and its paired base T). For example, sequence 1 of
These seemingly simple and minor mutations are not biologically equivalent and can have significant biological implications and consequences. Transition mutations are more commonly observed and generally result in less deleterious effects on cells, while transversions are generally less common and may lead to more severe phenotypic effects.
In order to express the message encoded in DNA, an RNA copy of the genetic information corresponding to a single gene is translated into the amino acid sequence of the encoded protein. The RNA copy, called a messenger RNA (mRNA) is read by the ribosome in packets of three nucleotide bases called codons. There are 64 codons, of which 61 can be translated. The remaining 3 codons are not translatable and cause the ribosome to stop and disassemble and reinitiate translation of a new message. The 61 codons code for the 20 different amino acids found in proteins. Of the 61 codons, there are 19 codons that encode 10 different amino acids that can be mutated at the first, second or third position to render that specific codon a non-translatable stop codon with a single base substitution. Of these 19 mutant codons, only 5 (coding for 3 different amino acids) result from transitions while the other 14 are the result of transversions. Table 1 lists the set of codons for which single base substitutions can cause conversion to stop codons.
From Table 1, it may be observed that single base substitutions resulting in termination of translation are caused primarily by transversions. Thus transition mutations leading to a truncated protein product with negative effects are far less likely. An alternative way to consider this is that translation stop codons are important in defining the correct mature C-terminal end of proteins.
However, stop codons can also be mutated to a codon that codes for an amino acid giving rise to a longer than intended polypeptide that will result in a reduced, null function or toxic product. Any base change of transversion type at an existing stop codon will result a codon that encodes an amino acid; this will allow read-through, since the codon becomes translatable (See Table 1). The only base changes to an existing stop codon that result in preserving a stop codon at that position are transition mutations.
There are various types of substitutions. For example, one base at a particular position may be replaced by one of the other bases, e.g., Transition (G<->A or C<->T) and/or Transversion (G/A<->C/T). In a reversion, the mutation reverts to the original base (at the same or a second site, and the function may be regained). In a silent mutation, a single base substition results in no change in the corresponding amino acid sequence in the protein being expressed. In a mis-sense multation, a base substitution causes a change at a single amino acid in a protein sequence. In a non-sense mutation, a base substitution that changes a codon specifying an amino acid to one of the three stop codons (UAA, UGA or UAG) thus producing a truncated protein.
In addition to substitutions, mutations may include insertions and deletions. It is noted, however, that other conditions, in addition to substitutions, insertions and deletions, can generate disease conditions. For example, re-arrangement of base sequences, addition of foreign sequences, triplet expansions as well as sequence variations and ordering manipulations may also occur and may result in expressed or unexpressed biological variations, disease conditions, and/or other problems. Each of these types of DNA mutations can be acquired and manifested in different ways and may exert their effects in different or similar fashions.
As with substitutions, there are different types of insertions and deletions. Deletions may include single or multiple base deletions, which are generally randomly distributed in a DNA sequence and are a common replication error, that may result in frame-shift mutation if they are not a multiple of three bases. Excision deletions are larger deletions with removal of transposable element. They may be integrated viral sequences or other repeat sequences. Excision deletions are generally precise events that are site directed and can lead to fusion proteins.
Insertions may be simple insertions, where single or multiple bases are inserted, usually at DNA replication. These are typically random events. Transformation insertions are insertions of any foreign DNA sequence in to a cell. In particular, conjugation is an integral part of insertions of bacterial DNA sequences into a host genome, and transduction insertions are insertion of viral sequences. Transposition insertions are insertions of a transposable element into a genome, which are capable of amplifying many copies throughout the genome. These are typically not random. Transposition may also include retrotransposons.
Alu family insertions are a 300 base repeat sequence found in various numbers of copies in the human genome and account for about 10 percent of the genome. Insertions in Alu can result in colorectal and breast cancer, hemophilia and other disease conditions. Cross Over insertions are rearrangements at the chromosomal level. These recombinant events can occur between different chromosomes or within pairs. Inversions are recombination events resulting in reversed polarity in a section of the inverted sequence. Splice site mutations result in an alternative splicing event of the mRNA processing. Repeat sequences are base sequences repeated throughout the genome. For example, the CA sequence repeats in humans. These may be used in genotyping. SINEs are short interspersed repetitive elements that are non-reverse transcriptase coded and that may amplify bases of mobile elements. Both SINE and LINE are non-LTR (long term repeat) transposable elements. While both types of transposon are duplicated via an RNA intermediate, only the LINE encode an enzyme that reverse transcribes the RNA transcript to give a DNA copy that is integrated in the host genome. SINEs consist of less than 500 bases, typically of Alul restriction endonuclease recognition sequences. LINEs are long interspersed repetitive elements that encode reverse transcriptase (e.g., RNA reverse transcriptase to DNA). Copy number variations are deletions or duplications of genes that may be associated with particular diseases. Aneuploidy is a sequence having an abnormal number of chromosomes. This may be associated with diseases such as Down's Syndrome.
These define mutation events based on DNA (genomic or mitochondrial) or RNA or proteins.
Compression of Biological Sequence DataAccording to aspects of the present invention, compression and associated processing may be used to manage biological sequence data and generate compressed sequence information as is further described herein.
For example, in one aspect, since any two sequences in the human genome are highly similar (i.e., approximately 99.9% the same), a reference sequence may be determined and then used to generate differences or deltas between a particular sequence and the reference sequence by taking advantage of base substitutions. The reference sequence may be determined by, for example, examining a set of available sequences and determining the reference sequence based on a characteristic of the set of sequences. This characteristic may be, for example, a minimal average distance between the set of available sequences and a particular sequence in the available set. The reference sequence may also be based on other sequences, such as a previous set of reference sequences or other forms of reference sequences as are described herein. In some implementations, multiple reference sequences may be used to encode one or more sequences such as may be stored in a database.
In another aspect, alternately or in addition to compression using reference sequences, actions may be used to define processing in conjunction with various mutations, such as insertions or deletions, to facilitate compression and/or other processing of biological data sequences.
Attention is now directed to
The uncompressed sequences stored in the database may be in a format such as FASTA, which is a text-based format for representing nucleotide or peptide sequences in which base pairs or amino acids are represented using single-letter codes, with optional sequence names and comments preceding the sequences.
At stage 320, compression may be applied to the set of uncompressed sequences retrieved from database 315. This stage may include converting the uncompressed data from a text format or other format to a binary format, such as is described subsequently herein. The resulting compressed data may be in the form of delta representations, as further described subsequently herein. In addition, embedded data (e.g., metadata), which may be of several types including, but limited to, deep sequencing data, single nucleotide polymorphisms (SNPs), gene expression data including alternative splicing, microRNA and CpG methylation data, copy number variations from comparative genome hybridization (CGH), transcriptomic, metabolomics and proteomic data in addition to pharmacological data, clinical data, or other data associated with the particular sequence or a part of the sequence may also be provided by a user and/or retrieved from a metadata database. The metadata may optionally be added to the uncompressed sequence data during or after the compression processing.
At stage 330, the compressed sequence data representing the set of uncompressed sequences stored in database 315, which may also include associated metadata, may be stored in a compressed sequence database 335. This may be in the same physical database as database 315 or may be in an alternate physical database or other form of memory. The compressed sequence data may then be transmitted to other systems or databases via a communications connection, such as via wired or wireless networks, the Internet, cellular networks and the like.
In order to perform compression such as is shown in
For example, in one embodiment, as shown in
Once the difference vector (or other difference measurement metric) is determined at stage 420, a selection of one or more reference sequences may be made at stage 430. This selection may be made by, for example, selecting a reference sequence or sequences based on a difference magnitude or average. As shown in
For purposes of explanation, sequence 3 may be chosen as the reference sequence. Sequence 3 may then be used to encode the other four sequences by determining deltas or differences with respect to sequences 1, 2, 4 and 5. This may be done by, for example, generating a set of data representing the position of differences between a particular sequence and the reference sequence, along with the corresponding base difference. An example is shown in
Using this approach, the compressed sequences shown in
Other processes may be used in alternate embodiments to generate reference sequences. It is noted that the reference sequence need not be any particular sequence from a database, such as was shown in the example of
Using this approach, bases from each position are evaluated across multiple sequences to determine the most common base associated with a particular position. More specifically, at stage 510, a set of sequences are selected from a database, such as database 315. The base nucleotide values at corresponding positions in each sequence are then compared across the set of sequences at stage 520. The most common or most frequently occurring base value is then selected as the reference value for that position in the sequence at stage 530.
As with the embodiment shown in
Although the examples of
In some embodiments, in addition to reference sequence generation and use in encoding sequences, a dictionary may be used to further compress the sequence data. For example, in the reference sequence encoded data shown in table 590, several of the position: base pairs occur repeatedly. The pair 4:C occurs twice, as do the pairs 20:G and 9:T. In actual genomics data based on a larger number of sequences in a set, it is likely that many pairs will repeat across the sequences. This characteristic can be used to further compress the data by selecting or generating a dictionary to further encode the position: base pairs using a dictionary process. This approach may include encoding the most frequently occurring pairs with the fewest number of bits.
One example of an embodiment of a dictionary is shown in table 600 of
In some implementations, the above-described processes may be implemented in an iterative fashion. For example, a count of reference entries to dictionary entries may be made to determine if the reference sequence should be updated. This may be repeated to maximize the degree of compression. Determination to update a reference sequence may in part be dependent on the size and sequence similarity or level of homology of the source database entries and the additional cost of storing a new reference sequence.
In many actual sets of genomic sequences, the total sequence size varies from individual sequence to sequence. Consequently, in addition to differences between position values from sequence to sequence (i.e., substitutions), nucleotide bases are sometimes added or removed, from one sequence relative to another sequence, depending on which sequence is considered the reference. This may be due to insertions or deletions as described previously. An example of this type of mutation is shown in
Consequently, the compression embodiments described previously with respect to reference sequences may still be applied to biological data having insertions/deletions, but with additional information added to identify the insertions/deletions (as well as, in some cases, other operations). For example, instead of using the position: base value as a stored delta value, an additional parameter, position:action:value may alternately or additionally be used to describe processing actions. The action may be used to identify processing for decoding the encoded sequence information, with the value being either a base or, in some actions, another parameter such as a size or distance parameter.
One example embodiment of a set of action values and corresponding coding is shown in
In addition, other operations may be included in the actions coding. For example, in typical sequences, particular bases are known to repeat, sometimes for many positions. Moreover, sub-sequences may also repeat. Actions to encode this repetition may be used as shown in table 800. In addition, inversions in sequences may occur (i.e., base pairs may be inverted with respect to their corresponding elements, A with T and C with G). Inversion of a segment of a sequence refers in addition to the that segment being in a reverse orientation or opposing polarity of the original state, i.e., a segment of base sequence in a 5′ to 3′ orientation being flipped to a 3′ to 5′ polarity. Example codings for repetitions and inversions are shown in
It will be apparent that, while the codings shown in
An example of this is shown in
Attention is now directed to
At stage 905 the source sequences are accessed for further processing. The database information may then be filtered at stage 910. This stage may include an initial processing step to determine characteristics such as, for example, whether the dataset fits the user's criteria for compression based on threshold of similarity in the dataset, as well as what suitable reference sequences may be used for compression based on delta values.
In one exemplary implementation, a sliding nucleotide window may be used to determine similarity. This could be done using a BLAST (Basic Logical Alignment Search Tool) or BLAST-like algorithm, which is an algorithm for comparing primary biological sequence information. A BLAST search enables comparing a query sequence with a library of database sequences so as to identify library sequences that resemble the query sequence above a certain threshold. This can be done in an iterative fashion.
A user criteria for DNA sequence compression may be, for example, a set maximum value for the highest delta value for any one entry in the database against a selected reference sequence. Another example of user criteria may be size, e.g., if the user is interested in compression of bacterial genomic DNA sequences only, then it would be expected that a sequence would be relatively short, for example, on the order of 106 bases as compared to the human genome having about 3 million bases.
For reference sequence selection, a user or processing system may have some knowledge of the type of data contained within databases being used. As discussed previously, the degree of compression will generally be related to the similarity of the sequences. Therefore, in a database containing a million entries of influenza virus, or alternately a particular human gene (such as, for example, the BRCA1 gene), a known sequence could be preselected as a reference. The delta values determined with this reference may suggest that the reference sequence is sub-optimal for the dataset being compressed. Alternatively, a more suitable reference sequence may be generated or assigned as the source database is pre-processed at stage 910. CAM may be used to facilitate rapid processing by allowing fast and efficient parsing of databases with million deep entries.
As noted previously, one or more reference sequences may be used for encoding. By using more reference sequences, the size of the compressed data may incrementally increase. However, use of fewer reference sequences and/or sub-optimal reference sequences, the degree of compression and/or size of delta database could be affected. If more than one reference sequence is used, each entry in the resultant compressed database would typically need to account for the specific reference sequence used for encoding (which would impact the degree of compression and total size).
As sequences from the database are initially imported, they are typically aligned. This may be done using CAM in a high speed data plane. Using CAM, delta value calculations may be achieved in as little as one clock cycle.
At stage 920, one or more reference sequences are generated or selected, which may be based on minimal delta value calculations, such as was described previously herein. Additional reference sequences may be generated or selected, which may be done using calculated delta values and/or biological relevance of the dataset for more suitable compression. For example, the uncompressed data may first be pre-processed to determine if a particular SNP or change in RFLP or a set profile (variation) might be present in a large portion of entries from the database. In this case, the original reference sequence may be changed; while preprocessing the source database with the initial reference if it becomes apparent that a large number of entries (for example, greater than 50%) contain a SNP or other element at a certain position then that reference sequence may be updated to include these properties so as to reduce the differential values.
Additional reference sequences may be generated or selected in an application specific manner. For example, if the source database contained tens of thousands or millions of complete human genomes, reference may be selected based on delta values within a certain region of the sequence. This may be based on, for example, disease association or other areas of interest.
At stage 940, delta value determination, along with the type of database(s) are used to profile the references. For example, if the database contains biomarker data from breast cancer patients only, other genes that are known to be associated with all of the different forms of breast cancers, in addition to BRCA1, would be present. For example, in this case the delta database would include large deletions and truncations in BRCA1 that are known to be associated with early disease onset like massive tumors before age 30 or, alternatively, these disease symptoms may be known to be associated with hormonal changes that occur after a first child as well. In this case, the deletion or truncation (as described previously herein) can be applied to the selected reference sequence for further compression.
At stage 950, a specific reference sequence is selected, which may be based on a minimum delta value. The reference sequence may be used at stage 960 to generate a dictionary from the dataset. For example, this may be based on all the known mutation events in BRCA1 (not limited to any one gene) correlated with all known clinical and pharmacological effects. Each mutation event within each entry that results in a phenotypic effect as well as silent mutations that are common in several entries can be placed in a dictionary using this approach for further compression of the data. As a result, the processing may take advantage of specific deltas from the reference that are common to multiple entries. A simplified example would be if in the dataset it was observed that a SNP in a certain splice site causes an alternative splicing event resulting in the inclusion of an exon with a premature termination codon upstream of position 1250 then a favorable outcome could be predicted for drug B as shown in Table 2 below.
At stage 970, a correlation table may be generated. This may include embedding data that provides for application specific compression. For example, mutation events with specific disease association or other phenotype can be coded, embedded and compressed along with the delta values in the database. As one example, methylation levels of CpG regions in or near regulatory and coding regions of certain genes is associated with many types of cancer. Consequently, the hypermethylation of BRCA 1 and any other associated breast cancer genes can be bit encoded at the specific methylated cytosine, information on the methylation of the regions known as CpG Islands can be embedded and this can all be compressed along with source data.
At stage 980, compressed DNA sequence data may be stored, which may be based on selected reference sequence(s), delta values and dictionary code, as well as other embedded data such as clinical or pharmacological data, or other data such as from images, screens, scans or other related data.
In various embodiments, aspects of the present invention may be implemented on a computer system or systems, or may be implemented in specific semiconductor devices such as chips or chipsets, ASICS, customized processors or on programmable devices such as FPGAs or other programmable devices.
Attention is now directed to
System 1000 may further include a CAM device 1050, which is configured for very high speed data location by accessing locations in the memory based on content rather than addresses as is done in traditional memories. In addition, one or more database 1060 may be included to store data such as compressed or uncompressed biological sequences, dictionary information, metadata or other data or information, such as computer files. Database 1060 may be implemented in whole or in part in CAM 1050 or may be in one or more separate physical memory devices.
System 1000 may also include one or more network connections 1040 configured to send or receive biological data from other database or computer systems. The network connection may allow users to receive uncompressed or compressed biological sequences from others as well as send uncompressed or compressed sequences. Network connections may include wired or wireless networks, such as Etherlan networks, T1 networks, 802.11 or 802.15 networks, cellular, LTE or other wireless networks, or other networking technologies known or developed in the art.
Memory space 1070 may be configured to store data as well as instructions for execution on processor(s) 1070 to implement the methods and processes described herein, as well as others within the scope of the invention. In particular, memory space 1070 may include a set of biological sequence processing modules including modules for performing processing functions including reference sequence generation, in module 1080, sequence compression, in module 1082, dictionary processing, in module 1084, metadata receipt, processing, and transmission, in module 1086, data integration, in module 1088, as well as other functions in associated modules (not shown). The various modules shown in system 1000 may include hardware, software, firmware or combinations of these to perform the associated functions. Further, the various modules may be combined or integrated, in whole or in part, in various implementations. In some embodiments, one or more elements shown in
Attention is now directed to
In addition, database 1060 may include a set of reference sequences 1064, which, in some embodiments, may be used to perform compression on the source sequences as described herein. In some implementations, the reference sequences may be selected from or generated based on one or more of the source sequences 1062. Alternately or in addition, the reference sequences may be selected for compression based on particular characteristics, such as representation of optimal sequences for compression for a particular species or other characteristics. In some implementations, no reference sequences may be present in the database 1060, with reference sequences selected or determined by a processor, such as processor 1010, based on methods such as those described herein.
In addition, database 1060 may include a set of metadata 1066 as described elsewhere herein. The metadata may include information associated with or otherwise related to sequence information or sequence locations. For example, the metadata may include experimental, clinical and/or pharmacological data associated with a particular segment of the sequences such as with a gene, a part of a gene or a plurality of genes. Other types of metadata may also be included in metadata 1066. In some implementations, metadata 1066 may not be present in database 1060 but may reside in other systems, such as external systems, and may be retrieved or accessed at the time of sequence processing, such as during compression, to be incorporated with the compressed sequence data.
Database 1066 may also include compressed data 1068. Compressed data 1068 may include compressed versions of source sequences 1062, and may also include associated metadata, reference sequence(s) used in compression, dictionary data, and/or other data or information associated with source sequence compression. The compressed data may include delta representations of source sequence information as described elsewhere herein.
Attention is now directed to
Attention is now directed to
Alternately, if no further reference sequence updating is chosen at stage 1350, processing may continue to stage 1350 where the compressed data may be stored in a memory or database, such as database 315 or another database.
Attention is now directed to
An optional module 1420 may be included to generate additional reference sequences and/or for providing dictionary processing as described previously herein. The dictionary processing may include replacing data with address information or other indexing information. The additional reference sequences may be used to generate additional delta representations, which may be used for further compression. An optional module 1430 may be included, module 1430 configured for storing the delta representations and/or dictionary processed data and/or added metadata in a database, such as database 1435. A processing decision stage 1440 may be included to allow for iterative processing in one or more of the modules, such as defining additional reference sequences at module 1410.
Attention is now directed to
A storage module 1540 may be included for storing the compressed data in a memory or database, such as database 1535, which may correspond with database 335 of
Attention is now directed to
Attention is now directed to
Attention is now directed to
Module 1830 is configured to store the compressed data (which may include delta representations, dictionary processed data such as address information, and/or metadata) in a compressed sequence database 1835, which may correspond to database 335 of
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.
In some configurations, systems and apparatus for sequence compression include means for performing various functions as described herein. In one aspect, the aforementioned means may be a processor or processors and associated memory in which embodiments reside, such as is shown in
In one or more exemplary embodiments, the functions, methods, processes and/or aspects described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, they may be stored on or encoded as one or more instructions or code on a computer-readable medium. The computer-readable medium may be part of a computer program product and may contain instructions for causing a computer to perform the methods, processes, functions and/or aspects described herein. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computer or processor system. By way of example, and not limitation, such computer-readable media can comprise RAM, DRAM, SRAM, PCM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, processor or other programmable instruction-based apparatus. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
It is understood that the specific order or hierarchy of steps or stages in the processes and methods disclosed are examples of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged while remaining within the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps or stages of a method, process or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC or other semiconductor device. The ASIC may reside in a user terminal or other user system. In the alternative, the processor and the storage medium may reside as discrete components.
The claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more. A phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover: a; b; c; a and b; a and c; b and c; and a, b and c.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. It is intended that the following claims and their equivalents define the scope of the invention.
Claims
1. A computer-implemented method of compressing a plurality of data sequences for use in a system including at least a processor and a memory storing a database of biological sequence data, the method comprising:
- determining, based upon comparison of the plurality of data sequences prior to compression of the plurality of data sequences, a reference sequence wherein the reference sequence includes data from the plurality of data sequences;
- processing the plurality of data sequences relative to the reference sequence in order to determine differences between the plurality of data sequences and the reference sequence;
- creating, using the reference sequence and based upon the differences, a plurality of first delta representations of the plurality of data sequences wherein the creating includes specifying, for each of the differences in ones of the plurality of first delta representations, a corresponding position in the modified reference sequence, an operation, and a value associated with the operation;
- defining a modified reference sequence;
- generating, using the modified reference sequence, a plurality of second delta representations of the plurality of data sequences;
- storing the plurality of first delta representations and the plurality of second delta representations in the database, thereby representing the plurality of first delta representations in a compressed format and conserving resources of the memory; and
- transmitting at least one of the plurality of first delta representations and one of the second plurality of representations.
2. The computer-implemented method of claim 1 further including evaluating the plurality of first delta representations wherein the defining includes generating the modified reference sequence by modifying the reference sequence based upon the evaluating.
3. The computer-implemented method of claim 1, further comprising replacing a portion of one of the plurality of first delta representations with address information associated with a dictionary structure, the portion of the first delta representation being stored within the dictionary structure.
4. The computer-implemented method of claim 2, further comprising replacing a portion of one of the plurality of second delta representations with address information associated with a dictionary structure, the portion of the second delta representation being stored within the dictionary structure.
5. The computer-implemented method of claim 2, wherein the defining is further based upon one or more characteristics of the plurality of first delta representations, wherein the one or more characteristics relate to a set of delta values associated with one or more predefined regions of the plurality of data sequences.
6. The computer-implemented method of claim 5, wherein the plurality of sequences comprise DNA sequences and wherein the one or more predefined regions are known to be associated with one or more disease conditions.
7. The computer-implemented method of claim 5, wherein the plurality of sequences comprise DNA sequences and wherein the one or more predefined regions are predicted to be associated with one or more disease conditions.
8. The computer-implemented method of claim 5, wherein the plurality of sequences comprise DNA sequences and wherein the one or more predefined regions are not known to be associated with a disease condition.
9. The computer-implemented method of claim 1, wherein the defining includes:
- identifying at least one sequence segment common to plural ones of the plurality of first delta representations; and
- modifying the reference sequence in accordance with the at least one sequence segment.
10. The computer-implemented method of claim 9, wherein the plurality of data sequences comprise DNA data sequences and wherein the at least one sequence segment comprises a mutation.
11. The computer-implemented method of claim 1, wherein the plurality of data sequences are stored within a database, the database further including embedded data associated with segments of the plurality of data sequences.
12. The computer-implemented method of claim 11, wherein the plurality of data sequences comprise DNA data sequences and wherein the segments comprise mutations.
13. The computer-implemented method of claim 11, further including replacing at least a portion of the embedded data with address information associated with a dictionary structure, the at least a portion of the embedded data being stored within the dictionary structure.
14. The computer-implemented method of claim 13, wherein the plurality of data sequences comprise DNA data sequences and wherein the embedded data comprises correlative information relating to mutations within the DNA data sequences.
15. The computer-implemented method of claim 14, wherein the correlative information includes pharmacological information.
16. The computer-implemented method of claim 14, wherein the correlative information includes clinical result information.
17. The computer-implemented method of claim 1, wherein the operation corresponding to one of the differences comprises substitution.
18. The computer-implemented method of claim 1, wherein the operation corresponding to one of the differences comprises insertion.
19. (canceled)
20. A computer program product comprising a non-transitory computer readable medium including codes for causing a computer to:
- determine, based upon comparison of a plurality of data sequences to be compressed, a reference sequence wherein the reference sequence includes data from the plurality of data sequences;
- process the plurality of data sequences relative to the reference sequence in order to determine differences between the plurality of data sequences and the reference sequence;
- create, using the reference sequence and based upon the differences, a plurality of first delta representations of the plurality of data sequences wherein for each of the differences, ones of the plurality of delta representations specify a corresponding position in the modified reference sequence, an operation, and a value associated with the operation;
- define a modified reference sequence by modifying the reference sequence;
- generate, using the modified reference sequence, a plurality of second delta representations of the plurality of data sequences;
- store at least one of the plurality of first delta representations and the plurality of second delta representations in a database contained within a memory, thereby representing the plurality of first delta representations and the plurality of second delta representations in a compressed format and conserving resources of the memory; and
- transmit at least one of the plurality of first delta representations and the plurality of second delta representations from the memory.
21. A computer-implemented method for compressing sequence data in a system including at least a processor and a memory containing a database, the method comprising:
- determining, based upon comparison of a plurality of data sequences prior to compression of the plurality of data sequences, a reference sequence wherein the reference sequence includes data from the plurality of data sequences;
- processing the sequence data relative to the reference sequence wherein the processing includes determining differences between the sequence data and the reference sequence;
- receiving metadata relating to the sequence data;
- generating, based upon the processing, a first delta representation of the sequence data wherein the delta representation includes compressed sequence data representative of the sequence data and embedded data representative of one or more characteristics of the sequence data wherein the embedded data includes the metadata and wherein the generating includes specifying, for each of the differences, a corresponding position in the reference sequence, an operation, and a value associated with the operation;
- defining a modified reference sequence by modifying the reference sequence;
- generating, using the modified reference sequence, a second delta representation of the sequence data;
- storing the first delta representation and the second delta representation in a database, thereby representing the sequence data in a compressed format and conserving resources of the processor and the memory; and
- generating a display on a display device using at least one the first delta representation and the second delta representation.
22-55. (canceled)
Type: Application
Filed: Mar 27, 2017
Publication Date: Feb 22, 2018
Inventor: Lawrence GANESHALINGAM (Laguna Beach, CA)
Application Number: 15/470,848