METHODS AND SYSTEMS FOR PROCESSING GENOMIC DATA

Info

Publication number: 20180052953
Type: Application
Filed: Mar 27, 2017
Publication Date: Feb 22, 2018
Inventor: Lawrence GANESHALINGAM (Laguna Beach, CA)
Application Number: 15/470,848

Abstract

Methods and systems for processing biological sequence data, such as genomic data, are disclosed. In one implementation, a set of biological sequences, such as DNA sequences, may be evaluated and one or more reference sequences may be determined or selected based on the set of biological sequences, which may include iterative determination or selection. The set of biological sequences may be compressed using the one or more reference sequences. Processing may include generating delta representations associated with the biological sequences, as well as generating dictionary information, which may be used for further compression. Metadata may be included in the compressed data. The compressed data may be stored in a compressed data file, which may be in a compressed genomic database or other data storage medium.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of application Ser. No. 12/828,234, entitled METHODS AND SYSTEMS FOR PROCESSING GENOMICS DATA, filed on Jun. 30, 2010, which claims priority under 35 U.S.C. §119(e) to U.S. Provisional Patent Application Ser. No. 61/358,854, entitled METHODS AND SYSTEMS FOR PROCESSING GENOMICS DATA, filed on Jun. 25, 2010, the content of which is incorporated by reference herein in its entirety for all purposes.

DESCRIPTION OF THE TEXT FILE SUBMITTED ELECTRONICALLY

The contents of the text file submitted electronically herewith are incorporated herein by reference in their entirety: A computer readable format copy of the Sequence Listing (filename: ANNA_001_00US_SeqList_ST25.txt, date recorded: Oct. 15, 2010, file size 2 kilobytes).

Please insert the sequence listing, submitted electronically herewith, after the abstract.

FIELD

This application is directed generally to the processing of genomic and other biological sequence data. More particularly, but not exclusively, the application relates to methods and systems for compressing, storing, processing and transmitting biological sequence data and associated information.

BACKGROUND

Deoxyribonucleic acid (“DNA”) sequencing is the process of determining the ordering of nucleotide bases (adenine (A), guanine (G), cytosine (C) and thymine (T)) in molecular DNA. Knowledge of DNA sequences is invaluable in basic biological research as well as in numerous applied fields such as but not limited to medicine, health, agriculture, livestock, population genetics, social networking, biotechnology, forensic science, security, and other areas of biology and life sciences.

Sequencing has been done since the 1970s, when academic researchers began using laborious methods based on two dimensional chromatography. Due to the initial difficulties in sequencing in the early 1970s, the cost and speed could be measured in scientist years per nucleotide base as researchers set out to sequence the first restriction endonuclease site with just a handful of bases.

Thirty years later, the entire 3.2 billion bases of the human genome have been sequenced, with a first complete draft of the human genome done at a cost of about three billion dollars. Since then sequencing costs have rapidly decreased. Today, many expect the cost of sequencing the human genome to be in the hundreds of dollars or less in the near future, with the results available in minutes, much like a routine blood test.

As the cost of sequencing the human genome continues to decrease, the number of individuals having their DNA sequenced for medical, as well as other purposes, will likely explode. Moreover, sequencing of other organisms will likely also increase for research purposes as well as disease analysis. Because of the large size of DNA sequences for many organisms, including humans, this explosion will lead to problems with DNA sequence data storage, transmission and analysis. Accordingly, there is a need in the art to address these problems as well as others.

SUMMARY

This disclosure relates generally to methods and systems for compressing, storing, processing, analyzing and transmitting biological sequence data and associated information.

In one aspect, the disclosure is directed to a computer-implemented method for processing a plurality of biological data sequences, the method comprising evaluating the plurality of data sequences, defining at least a first reference sequence based upon the evaluating, comparing the plurality of data sequences to the first reference sequence and generating, based upon the comparing, a first plurality of delta representations of the plurality of data sequences.

In another aspect, the disclosure is directed to a computer-implemented method of compressing a plurality of biological data sequences, the method comprising creating, using a reference sequence, a plurality of first delta representations of the plurality of data sequences, defining a modified reference sequence by modifying the reference sequence based upon one or more characteristics of the plurality of first delta representations and generating, using the modified reference sequence, a plurality of second delta representations of the plurality of data sequences.

In another aspect, the disclosure is directed to a computer-implemented method for compressing biological sequence data, the method comprising processing the sequence data relative to a reference sequence and generating, based upon the processing, a delta representation of the sequence data wherein the delta representation includes embedded data associated with the sequence data.

In another aspect, the disclosure is directed to a computer-implemented method of compressing a plurality of biological data sequences contained within a database, the method comprising selecting a reference sequence based upon the plurality of data sequences, compressing the plurality of data sequences using the reference sequence to yield a plurality of first delta representations of the plurality of data sequences, generating a plurality of second delta representations by transforming the plurality of first delta representations using a dictionary procedure and creating a modified reference sequence by modifying the reference sequence based upon the plurality of second delta representations.

In another aspect, the disclosure is directed to a computer-implemented method for compressing a plurality of biological sequences corresponding to a particular species, the method comprising selecting a first reference sequence associated with the particular species, comparing the plurality of biological sequences to the reference sequence and generating, based upon the comparing, a first plurality of delta representations of the plurality of biological sequences. The particular species may be an animal species, plant species, or other species or organism such as bacteria, viruses, fungi, other organisms or agents that use DNA, RNA or biopolymer as genomic material, including, for example, prions. The first reference sequence may be selected from the plurality of biological sequences or may be selected independently of the plurality of biological sequences.

In another aspect, the disclosure is directed to a semiconductor device or devices for performing the computer-implemented methods described above.

In another aspect, the disclosure is directed to a computer system for performing the computer-implemented methods described above.

In another aspect, the disclosure is directed to means for performing the computer-implemented methods described above.

In another aspect, the the disclosure is directed to a computer program product including a computer-readable medium containing instruction for execution by a processor to perform the methods described above.

In another aspect, the disclosure is directed to a computer data storage product including genomic sequence data encoded using the computer-implemented methods described above.

Additional aspects are further described below in conjunction with the appended drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The present application may be more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, wherein:

FIG. 1 illustrates details of an example binary coding scheme for base nucleotides in a DNA sequence;

FIG. 2 illustrates an example of a set of binary encoded DNA sequences (SEQ ID NOS.:1 and 3-6) stored in a memory based on the binary coding of FIG. 1;

FIG. 3 illustrates one embodiment of a process for compressing a set of genomic sequences;

FIG. 4A illustrates one embodiment of a process for selecting a reference sequence for use in sequence compression;

FIG. 4B illustrates one embodiment of a table for use in determining a difference vector and associated averages for the set of sequences shown in FIG. 2;

FIG. 4C illustrates one embodiment of example coding (delta representations) using the process of FIG. 4B with the set of sequences shown in FIG. 2;

FIG. 5A illustrates one embodiment of a process for generating a reference sequence for use in sequence compression;

FIG. 5B illustrates a resulting reference sequence (SEQ ID NO.: 7) using the process of FIG. 5A and the sequences of FIG. 2;

FIG. 5C illustrates one embodiment of example coding (delta representations) using the process of FIG. 5B with the sequences shown in FIG. 2;

FIG. 6A illustrates one embodiment of a dictionary for use in further compressing sequences using dictionary processing;

FIG. 6B illustrates example encoding of the sequence of FIG. 5C with the dictionary data of FIG. 6A;

FIG. 7 illustrates example sequences (SEQ ID NOS.:8 and 9) having insertions/deletions;

FIG. 8 illustrates one embodiment of a table for use in sequence encoding using actions;

FIG. 9 illustrates one embodiment of a process for performing sequence compression using dictionary processing;

FIG. 10 illustrates one embodiment of a computer system for processing genomic information;

FIG. 11 illustrates one embodiment of content of a database such as is shown in FIG. 10;

FIG. 12 illustrates one embodiment of a process for sequence compression;

FIG. 13 illustrates one embodiment of a process for sequence compression;

FIG. 14 illustrates one embodiment of a system for sequence compression;

FIG. 15 illustrates one embodiment of a system for sequence compression;

FIG. 16 illustrates one embodiment of a system for sequence compression;

FIG. 17 illustrates one embodiment of a system for sequence compression;

FIG. 18 illustrates one embodiment of a system for sequence compression.

DETAILED DESCRIPTION OF EMBODIMENTS

This disclosure relates generally to processing, storage and transmission of biological sequences such as DNA sequences.

In one aspect, the disclosure is directed to a computer-implemented method for processing a plurality of data sequences, the method comprising evaluating the plurality of data sequences, defining at least a first reference sequence based upon the evaluating, comparing the plurality of data sequences to the first reference sequence and generating, based upon the comparing, a first plurality of delta representations of the plurality of data sequences. The delta representations may include one or more operations capable of being performed to regenerate one of the plurality of data sequences based upon the first reference sequence. The operations may be associated with substitution processing and/or insertion/deletion processing.

The method may further include defining a second reference sequence, performing a comparison of the plurality of delta representations to the second reference sequence and generating, based upon the comparison, a second plurality of delta representations of the plurality of data sequences. The second reference sequence may be generated by modifying, based upon an evaluation of the first plurality of delta representations, the first reference sequence. The method may further include generating, using the second reference sequence, a second plurality of delta representations of the plurality of data sequences. The generating a second reference sequence may include identifying at least one sequence segment common to plural ones of the first plurality of delta representations and modifying the first reference sequence in accordance with the at least one sequence segment. The plurality of data sequences may comprise DNA data sequences, and at least one sequence segment may comprises a mutation. The plurality of data sequences may be stored in a database, and the database may further including embedded data associated with segments of the sequences.

The method may further include replacing at least a portion of the embedded data with address information associated with a dictionary structure, where the at least a portion of the embedded data is stored within the dictionary structure. The plurality of data sequences may comprise DNA data sequences and the embedded data may comprise correlative information relating to mutations within the DNA data sequences. The correlative information may include pharmacological information, clinical result information and/or other information. The one or more operations of the method may comprise substitution, insertion and/or deletion.

The evaluating may include determining an average distance of each of the plurality of data sequences from other of the plurality of data sequences. The reference sequence may comprise one of the plurality of data sequences that has a minimum average distance from others of the plurality of data sequences. The evaluating may include determining, for sequence positions of the plurality of reference sequences, entries for the sequence positions common to plural of the plurality of data sequences. The defining may include assigning the entries to respective sequence positions of the first reference sequence.

In another aspect, the disclosure is directed to a computer-implemented method of compressing a plurality of data sequences, the method comprising creating, using a reference sequence, a plurality of first delta representations of the plurality of data sequences, defining a modified reference sequence by modifying the reference sequence based upon one or more characteristics of the plurality of first delta representations and generating, using the modified reference sequence, a plurality of second delta representations of the plurality of data sequences. The method may further comprise replacing a portion of one of the plurality of first delta representations with address information associated with a dictionary structure, with the portion of the first delta representation being stored within the dictionary structure. The method may further comprise replacing a portion of one of the plurality of second delta representations with address information associated with a dictionary structure, with the portion of the second delta representation being stored within the dictionary structure. The one or more characteristics may relate to a set of delta values associated with one or more predefined regions of the plurality of data sequences. The plurality of sequences may comprise DNA sequences, and the one or more predefined regions may be known to be associated with a disease condition.

The defining may include identifying at least one sequence segment common to plural ones of the plurality of first delta representations and modifying the reference sequence in accordance with the at least one sequence segment. The plurality of data sequences may comprise DNA data sequences and the at least one sequence segment may comprise a mutation. The plurality of data sequences may be stored within a database, and the database may further including embedded data associated with segments of the plurality of data sequences.

The method may further including replacing at least a portion of the embedded data with address information associated with a dictionary structure, and the at least a portion of the embedded data may be stored within the dictionary structure. The plurality of data sequences may comprise DNA data sequences and the embedded data may comprises correlative information relating to mutations within the DNA data sequences. The correlative information may include pharmacological information and/or clinical result information and/or other information.

The generating may include determining differences between the plurality of data sequences and the modified reference sequence. The generating may include specifying, for each of the differences, a corresponding position in the modified reference sequence, an operation, and a value associated with the operation. The operation corresponding to one of the differences may be substitution, insertion or deletion.

In another aspect, the disclosure is directed to a computer-implemented method for compressing sequence data, the method comprising processing the sequence data relative to a reference sequence and generating, based upon the processing, a delta representation of the sequence data wherein the delta representation includes embedded data associated with the sequence data. The method may further comprise replacing a first portion of the delta representation with address information associated with a dictionary structure, with the first portion of the delta representation being stored within the dictionary structure. The embedded data may be included within a second portion of the delta representation, and the method may further comprise replacing the second portion of the delta representation with address information associated with a dictionary structure, the second portion of the delta representation being stored within the dictionary structure.

The processing may include determining differences between the sequence data and the reference sequence. The generating may include specifying, for each of the differences, a corresponding position in the reference sequence, an operation, and a value associated with the operation. The operation corresponding to one of the differences may be substitution, insertion or deletion. The sequence data may comprise a sequence of monomers associated with a polymer. The polymer may be a biopolymer, which may be DNA or RNA.

The method may further include replacing compressed information within the delta representation with address information associated with a dictionary structure, with the compressed information being stored within the dictionary structure. The method may further including replacing at least a portion of the embedded data with address information associated with a dictionary structure, with the at least a portion of the embedded data being stored within the dictionary structure. The method may further including compressing embedded data associated with the sequence data by replacing the embedded data with address information associated with a dictionary structure with the embedded data is stored within the dictionary structure.

The sequence data may comprise DNA sequence data with the embedded data comprising correlative information relating to mutations within the DNA sequence data. The correlative information may includes pharmacological information and/or clinical result information and/or other information. The delta representation may include information relating to base modifications within the DNA sequence data. The base modifications may include one or more of methylation, carboxylation, and formylation. The base modifications may include one or more of deamination and/or any other base modification or analogs.

In another aspect, the disclosure is directed to a computer-implemented method of compressing a plurality of data sequences contained within a database, the method comprising selecting a reference sequence based upon the plurality of data sequences, compressing the plurality of data sequences using the reference sequence to yield a plurality of first delta representations of the plurality of data sequences, generating a plurality of second delta representations by transforming the plurality of first delta representations using a dictionary procedure and creating a modified reference sequence by modifying the reference sequence based upon the plurality of second delta representations.

The method may further include compressing the plurality of data sequences using the modified reference sequence. At least one of the plurality of first delta representations may include embedded data associated with a corresponding one of the plurality of delta representations. The dictionary procedure may include replacing at least a portion of the embedded data with address information associated with a dictionary structure, with the at least a portion of the embedded data being stored within the dictionary structure. The creating may include identifying at least one sequence segment common to plural ones of the plurality of first delta representations and modifying the reference sequence in accordance with the at least one sequence segment.

The plurality of data sequences may comprise DNA data sequences, and at least one sequence segment may include a mutation. The plurality of data sequences may be stored within a database, and the database may further include embedded data associated with segments of the plurality of data sequences. The plurality of data sequences may comprise DNA data sequences and the embedded data comprises correlative information relating to mutations within the DNA data sequences. The correlative information may include pharmacological information, clinical result information, and/or other data or information.

The compressing may include determining differences between the plurality of data sequences and the reference sequence. The compressing may include specifying, for each of the differences, a corresponding position in the reference sequence, an operation and a value associated with the operation. The operation corresponding to one of the differences may be substitution, insertion or deletion.

In another aspect, the disclosure is directed to a computer-implemented method for compressing a plurality of biological sequences corresponding to a particular species, the method comprising selecting a first reference sequence associated with the particular species, comparing the plurality of biological sequences to the reference sequence, and generating, based upon the comparing, a first plurality of delta representations of the plurality of biological sequences. The particular species may be an animal species, plant species, or other species or organism such as bacteria, viruses, fungi, or other organisms or agents that use DNA, RNA or biopolymer as genomic material, including, for example, prions. The first reference sequence may be selected from the plurality of biological sequences or may be selected independently of the plurality of biological sequences.

In another aspect, the disclosure is directed to a semiconductor device or devices for performing the computer-implemented methods described above.

In another aspect, the disclosure is directed to a computer system for performing the computer-implemented methods described above.

In another aspect, the disclosure is directed to means for performing the computer-implemented methods described above.

In another aspect, the disclosure is directed to a computer program product including a computer-readable medium containing instruction for execution by a processor to perform the methods described above.

In another aspect, the the disclosure is directed to a computer data storage product including biological sequence data encoded using the computer-implemented described above.

Various aspects of the present invention are described below. It should be apparent that the teachings herein may be embodied in a wide variety of forms and that any specific structure, function, or both being disclosed herein is merely representative and not intended to be limiting. Based on the teachings herein one skilled in the art should appreciate that an aspect disclosed herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus or system may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, such an apparatus or system may be implemented or such a method may be practiced using other structure, functionality, or structure and functionality in addition to or other than one or more of the aspects set forth herein. Furthermore, an aspect may comprise at least one element of a claim.

While the embodiments described below are generally based on sequences of DNA, it will be apparent that embodiments of the present invention may equally be implemented for processing other biological sequences, such as RNA sequences, protein sequences, or other types of biological sequences. Accordingly, the present invention is not in any way limited in application to DNA sequences, but may rather be applied to a wide range of other biological sequences, as well as, in some applications, other types of non-biological data sequences or computer generated or simulated sequences.

Overview of Genomic Sequencing

Genomic sequences are sequences of data describing genomic characteristics of a particular organism. The term “genomic” generally refers to data that both codes (also referred to as “genetic” data) as well as data that is non-coding. The term “genome” refers to an organism's entire hereditary information. Genomic sequencing is the process of determining a particular organism's genomic sequence.

The human genome, as well as that of other organisms, is made of four chemical units called nucleotide bases (also referred to herein as “bases” for brevity). These bases are adenine(A), thymine(T), guanine(G) and cytosine(C). Double stranded sequences are made of paired nucleotide bases, where each base in one strand pairs with a base in the other strand according to the Watson-Crick pairing rule: i.e., A pairs with T and C pairs with G (In RNA, Thymine is replaced with Uracil (U), which pairs with A).

A sequence is a series of bases, ordered as they are arranged in molecular DNA or RNA. For example, a sequence may include a series of bases arranged in a particular order, such as the following example sequence fragment:

(SEQ ID NO.: 1) ACGCCGTAACGGGTAATTCA.

The human haploid genome contains approximately 3 billion base pairs, which may be further broken down into 23 pairs of chromosomes. The 23 chromosomes include about 30,000 genes. While each individual's genome sequence is different, there is much redundancy between individuals of a particular genus, and in many cases there is also much redundancy across similar species. For example, in the human genome the sequences of two individuals are about 99.9% equivalent, and are therefore highly redundant. Viewed in another way, the number of differences in bases in sequences of different individuals is corresponding small. These differences may include differences in the particular nucleotide at a position in the sequence, also known as a single nucleotide polymorphism or SNP, as well as addition or subtraction of nucleotides between individuals' sequences at corresponding positions in the sequences.

Because of the enormous size of the human genome, as well as the genomes of many other organisms, storage and processing genomic sequences (which are typically separate sequences generated from a particular individual or organism, but may also be a sequence fragment, sub-sequence, sequence of a particular gene, etc.) creates problems with processing, memory storage, and data transmission. Consequently, it is typically important to store the sequences in as little space as possible. Moreover, it is typically important that no information is lost in storage and transmission. Accordingly, processing for storage or transmission of whole or partial sequences should include removing redundant information in a sequence in a lossless fashion.

Existing sequence storage techniques use coding for the four nucleotides (A, C, G and T) which may map then to characters in a text format. This sequence information may be mapped to binary data. For example, A may be mapped to binary 00, C may be mapped to 01, G to 10 and T to 11 as shown in FIG. 1. Obviously, other encodings may also be used. These binary encodings may be stored in a computer memory as arranged in the mapped sequence (as shown), or in other arrangements.

FIG. 2 illustrates an example of this mapping and memory storage, where the illustrated memory is configured with 16 bit memory locations. However, in other implementations, other memory sizes and configurations could alternately be used. For example, 16, 32 or 64 bit coding could be used where modified bases are involved.

Five sequences, sequences 210-250, are shown, along with associated memory mappings of the sequences in memory locations 210M-250M, which may be in a memory device such as DRAM, SRAM, Flash, CAM, PCM, and the like may be in a database such as on a disk drive or other storage media such as DVD ROM, Blu-Ray, etc. In a memory or database, the information shown would require 5 times 40 bits or 200 bits. In this example the sequence size is very small, however, for typical sequences, such as a human sequence, each individual's sequence data would be approximately six billion bits long (i.e., about 6 Gb, or about .75 Gigabytes (GB)) if coded as shown.

Consequently, for a database having a relatively small number of sequence entries (for example, 1024 entries or 1K), the database size would approach one terabyte, which is impractical for storage, movement, processing or analyzing for widespread use with current computing technologies. However, in genomic sequences within species (and in many cases across species) the nucleotide bases are typically very similar between individuals, normally having very small deviations. This characteristic of DNA may be used, as further described subsequently herein, to effect coding for compression of sequence data as well as perform other processing and output data generation and distribution.

Variations in the DNA sequences of different individuals are a result of deviations (also known as mutations). For example, one type of mutation relates to substitutions of nucleotide bases at common or reference positions in the sequence. A base substitution (also known as a point mutation) is the result of one base in a sequence at a particular position or reference location being replaced with a different one (relative to another sequence, which may be a reference sequence from which other sequences are compared). A base substitution can be either a transition (e.g., between G and A, or C and

T) or a transversion (e.g., between G and its paired base C, or A and its paired base T). For example, sequence 1 of FIG. 2 has a transition, with reference to sequence 2, at position 20 (i.e., the G of sequence 2 is replaced with an A in sequence 1).

These seemingly simple and minor mutations are not biologically equivalent and can have significant biological implications and consequences. Transition mutations are more commonly observed and generally result in less deleterious effects on cells, while transversions are generally less common and may lead to more severe phenotypic effects.

In order to express the message encoded in DNA, an RNA copy of the genetic information corresponding to a single gene is translated into the amino acid sequence of the encoded protein. The RNA copy, called a messenger RNA (mRNA) is read by the ribosome in packets of three nucleotide bases called codons. There are 64 codons, of which 61 can be translated. The remaining 3 codons are not translatable and cause the ribosome to stop and disassemble and reinitiate translation of a new message. The 61 codons code for the 20 different amino acids found in proteins. Of the 61 codons, there are 19 codons that encode 10 different amino acids that can be mutated at the first, second or third position to render that specific codon a non-translatable stop codon with a single base substitution. Of these 19 mutant codons, only 5 (coding for 3 different amino acids) result from transitions while the other 14 are the result of transversions. Table 1 lists the set of codons for which single base substitutions can cause conversion to stop codons.

TABLE 1 Stop Codon Tranversions Transitions UAA AAA^(Lys) GAA^(Glu) UCG^(Gln) UUA^(Leu) UCA^(Ser) UGA UAU^(Tyr) UAC^(Tyr) UAG UAG UCG^(Ser) AAG^(Lys) GAG^(Glu) CAG^(Gln) UAU^(Tyr) UAC^(Tyr) UUG^(Leu) UGG^(Trp) UAA UGA AGA^(Arg) UUA^(Leu) UGC^(Cys) CGA^(Arg) GGA^(Gly) UCA^(Ser) UGU^(Cys) UAA UGG^(Trp)

From Table 1, it may be observed that single base substitutions resulting in termination of translation are caused primarily by transversions. Thus transition mutations leading to a truncated protein product with negative effects are far less likely. An alternative way to consider this is that translation stop codons are important in defining the correct mature C-terminal end of proteins.

However, stop codons can also be mutated to a codon that codes for an amino acid giving rise to a longer than intended polypeptide that will result in a reduced, null function or toxic product. Any base change of transversion type at an existing stop codon will result a codon that encodes an amino acid; this will allow read-through, since the codon becomes translatable (See Table 1). The only base changes to an existing stop codon that result in preserving a stop codon at that position are transition mutations.

There are various types of substitutions. For example, one base at a particular position may be replaced by one of the other bases, e.g., Transition (G<->A or C<->T) and/or Transversion (G/A<->C/T). In a reversion, the mutation reverts to the original base (at the same or a second site, and the function may be regained). In a silent mutation, a single base substition results in no change in the corresponding amino acid sequence in the protein being expressed. In a mis-sense multation, a base substitution causes a change at a single amino acid in a protein sequence. In a non-sense mutation, a base substitution that changes a codon specifying an amino acid to one of the three stop codons (UAA, UGA or UAG) thus producing a truncated protein.

In addition to substitutions, mutations may include insertions and deletions. It is noted, however, that other conditions, in addition to substitutions, insertions and deletions, can generate disease conditions. For example, re-arrangement of base sequences, addition of foreign sequences, triplet expansions as well as sequence variations and ordering manipulations may also occur and may result in expressed or unexpressed biological variations, disease conditions, and/or other problems. Each of these types of DNA mutations can be acquired and manifested in different ways and may exert their effects in different or similar fashions.

As with substitutions, there are different types of insertions and deletions. Deletions may include single or multiple base deletions, which are generally randomly distributed in a DNA sequence and are a common replication error, that may result in frame-shift mutation if they are not a multiple of three bases. Excision deletions are larger deletions with removal of transposable element. They may be integrated viral sequences or other repeat sequences. Excision deletions are generally precise events that are site directed and can lead to fusion proteins.

Insertions may be simple insertions, where single or multiple bases are inserted, usually at DNA replication. These are typically random events. Transformation insertions are insertions of any foreign DNA sequence in to a cell. In particular, conjugation is an integral part of insertions of bacterial DNA sequences into a host genome, and transduction insertions are insertion of viral sequences. Transposition insertions are insertions of a transposable element into a genome, which are capable of amplifying many copies throughout the genome. These are typically not random. Transposition may also include retrotransposons.

Alu family insertions are a 300 base repeat sequence found in various numbers of copies in the human genome and account for about 10 percent of the genome. Insertions in Alu can result in colorectal and breast cancer, hemophilia and other disease conditions. Cross Over insertions are rearrangements at the chromosomal level. These recombinant events can occur between different chromosomes or within pairs. Inversions are recombination events resulting in reversed polarity in a section of the inverted sequence. Splice site mutations result in an alternative splicing event of the mRNA processing. Repeat sequences are base sequences repeated throughout the genome. For example, the CA sequence repeats in humans. These may be used in genotyping. SINEs are short interspersed repetitive elements that are non-reverse transcriptase coded and that may amplify bases of mobile elements. Both SINE and LINE are non-LTR (long term repeat) transposable elements. While both types of transposon are duplicated via an RNA intermediate, only the LINE encode an enzyme that reverse transcribes the RNA transcript to give a DNA copy that is integrated in the host genome. SINEs consist of less than 500 bases, typically of Alul restriction endonuclease recognition sequences. LINEs are long interspersed repetitive elements that encode reverse transcriptase (e.g., RNA reverse transcriptase to DNA). Copy number variations are deletions or duplications of genes that may be associated with particular diseases. Aneuploidy is a sequence having an abnormal number of chromosomes. This may be associated with diseases such as Down's Syndrome.

These define mutation events based on DNA (genomic or mitochondrial) or RNA or proteins.

Compression of Biological Sequence Data

According to aspects of the present invention, compression and associated processing may be used to manage biological sequence data and generate compressed sequence information as is further described herein.

For example, in one aspect, since any two sequences in the human genome are highly similar (i.e., approximately 99.9% the same), a reference sequence may be determined and then used to generate differences or deltas between a particular sequence and the reference sequence by taking advantage of base substitutions. The reference sequence may be determined by, for example, examining a set of available sequences and determining the reference sequence based on a characteristic of the set of sequences. This characteristic may be, for example, a minimal average distance between the set of available sequences and a particular sequence in the available set. The reference sequence may also be based on other sequences, such as a previous set of reference sequences or other forms of reference sequences as are described herein. In some implementations, multiple reference sequences may be used to encode one or more sequences such as may be stored in a database.

In another aspect, alternately or in addition to compression using reference sequences, actions may be used to define processing in conjunction with various mutations, such as insertions or deletions, to facilitate compression and/or other processing of biological data sequences.

Attention is now directed to FIG. 3, which illustrates an embodiment of a process 300 for compressing biological data sequences using a reference sequence, which may be selected or generated as described subsequently herein. Process 300 may begin at stage 310 with selection or generation of one or more reference sequences. This may be based on evaluation of a set of data sequences stored in an uncompressed sequence database. For example, the reference sequence may be selected or generated from a set of sequences in a source database 315, which may contain a set of sequences in an uncompressed format. Alternately, the reference sequence may be predefined or otherwise selected independent of the particular sequences stored in a database such as database 315.

The uncompressed sequences stored in the database may be in a format such as FASTA, which is a text-based format for representing nucleotide or peptide sequences in which base pairs or amino acids are represented using single-letter codes, with optional sequence names and comments preceding the sequences.

At stage 320, compression may be applied to the set of uncompressed sequences retrieved from database 315. This stage may include converting the uncompressed data from a text format or other format to a binary format, such as is described subsequently herein. The resulting compressed data may be in the form of delta representations, as further described subsequently herein. In addition, embedded data (e.g., metadata), which may be of several types including, but limited to, deep sequencing data, single nucleotide polymorphisms (SNPs), gene expression data including alternative splicing, microRNA and CpG methylation data, copy number variations from comparative genome hybridization (CGH), transcriptomic, metabolomics and proteomic data in addition to pharmacological data, clinical data, or other data associated with the particular sequence or a part of the sequence may also be provided by a user and/or retrieved from a metadata database. The metadata may optionally be added to the uncompressed sequence data during or after the compression processing.

At stage 330, the compressed sequence data representing the set of uncompressed sequences stored in database 315, which may also include associated metadata, may be stored in a compressed sequence database 335. This may be in the same physical database as database 315 or may be in an alternate physical database or other form of memory. The compressed sequence data may then be transmitted to other systems or databases via a communications connection, such as via wired or wireless networks, the Internet, cellular networks and the like.

In order to perform compression such as is shown in FIG. 3, a reference sequence or sequences must be selected or generated. Returning to FIG. 2, there are differences in nucleotide bases between the various example sequences as shown in 210-250. For example, sequence 220 has an A in the second position, whereas the other sequences have a C in this position. Various other differences between the sequences shown can also be seen. However, the relative amount of differences between the example sequences shown are relatively small, in this case about 10 percent (whereas, as noted previously, in the human genome these differences are considerably smaller, approximately 0.1 percent). These differences can be used in various ways to generate or select one or more reference sequences.

For example, in one embodiment, as shown in FIG. 4A, a group of sequences may be selected from a source database, such as database 315 as shown in FIG. 3, and loaded into a table or other memory structure. In some implementations, the memory structure may be a conventional memory structure such as a dynamic RAM (DRAM) or other conventional memory structure. Alternately or in addition, in some implementations the memory structure may be a Content Addressable Memory (CAM), which is a type of computer memory configured for high speed searching applications. Unlike a standard computer memory where a user supplies a memory address and the memory (e.g., RAM) returns a data word stored at the address, a CAM is designed such that a user supplies a data word and the CAM searches its entire memory to see if the data word is stored anywhere in the memory space. If the data word is found, the CAM returns a list of one or more storage addresses where the word was found (and may also return the data word or other associated pieces of data). A difference vector may then be generated for each entry. The different vector may be in the form of a number of nucleotide differences between each other sequence in the set of sequences at a particular location. An example of this method is shown in FIG. 5B, which illustrates a difference vector table 480 for the 5 sequences shown in FIG. 2. For example, sequences 1 and 2 have two nucleotide differences (i.e., at positions 2 and 20), whereas sequences 1 and 5 have three differences (at positions 4, 9 and 18).

Once the difference vector (or other difference measurement metric) is determined at stage 420, a selection of one or more reference sequences may be made at stage 430. This selection may be made by, for example, selecting a reference sequence or sequences based on a difference magnitude or average. As shown in FIG. 5B, the average difference magnitude for sequences 1, 3 and 4 are all 1.8 (shown in boxes 485, 487 and 489, respectively). Any of these sequences could be selected as the reference sequence based on having the minimum difference with respect to the other sequences in the group. However, additional criteria may be applied to further optimize compression. For example, the sequence having the smallest variance in differences, or the sequence which minimizes the maximum difference may be chosen. In this example, sequence 4, while having the same average value as sequences 1 and 3, has a difference value of four with respect to sequence 2, and therefore may be excluded. Since sequences 1 and 3 both have maximum differences of three, either may be chosen.

For purposes of explanation, sequence 3 may be chosen as the reference sequence. Sequence 3 may then be used to encode the other four sequences by determining deltas or differences with respect to sequences 1, 2, 4 and 5. This may be done by, for example, generating a set of data representing the position of differences between a particular sequence and the reference sequence, along with the corresponding base difference. An example is shown in FIG. 4C, which illustrates difference data for the example sequences of FIG. 2 in table 490. For example, sequence 1 varies from the reference sequence (i.e., sequence 3) at position 4, where the A in position 4 of the reference sequence is replaced by a C. Corresponding differences (deltas) are shown in table 490 for sequences 2, 4 and 5. Sequence 3 will obviously have no deltas since it is the reference sequence in this example.

Using this approach, the compressed sequences shown in FIG. 2 would result in a compressed data size of 103 bits (40 bits for the reference sequence plus 9 times 7 bits (assuming 7 bits are used to encode each position: base pair)). Although the data size reduction in this example is only approximately 50 percent, with actual genomic data, the compression would typically be much greater for at least the following reasons. First, the variation shown in the example is 10 percent (i.e., 2 base position out of 20 differ), whereas, in actual genomic data the variation is usually much smaller (as noted previously, the sequence data is approximately 99.9 percent similar for humans). Second, in the example shown, the reference sequence accounts for a large percentage of the total number of bits (i.e., 40) because the number of sequences in the encoded set is small (i.e., 5). For typical sets of sequences with large numbers of entries, the number of bits in the reference sequence relative to the total size of the compressed data will become small as the number of sequences increases. In addition, while the example shown uses only one reference sequence for compression, multiple reference sequences may be used to further reduce compression size. The added cost of multiple sequences will be relatively small with respect to the total size of data transmitted (i.e., 2*N additional bits for each additional sequence of length N plus M addition bits per coded sequence to identify the reference sequence used to encode the particular sequence).

Other processes may be used in alternate embodiments to generate reference sequences. It is noted that the reference sequence need not be any particular sequence from a database, such as was shown in the example of FIG. 4B. For example, FIG. 5A illustrates another embodiment of a method for generating a reference sequence that does not select a particular reference sequence from a set as shown in FIG. 4A. In the embodiment of FIG. 5A, a reference sequence is generated from other data or information using process 500. For example, process 500 may be used to generate a reference sequence using data from the set of sequences shown in FIG. 2, or from other data.

Using this approach, bases from each position are evaluated across multiple sequences to determine the most common base associated with a particular position. More specifically, at stage 510, a set of sequences are selected from a database, such as database 315. The base nucleotide values at corresponding positions in each sequence are then compared across the set of sequences at stage 520. The most common or most frequently occurring base value is then selected as the reference value for that position in the sequence at stage 530.

FIG. 5B illustrates example results of application of process 500 to the sequences shown in FIG. 2 in Table 580. The most commonly occurring base at position one is A (i.e., position 1 values for sequences 1-5 of FIG. 2 all have values of A in the first position). For position 2, sequences 1, 3, 4 and 5 have values of C (sequence 2 has a value of A), so C is chosen for this position of the reference sequence. Accordingly, all of the position values are evaluated to determine the most commonly occurring base nucleotide, which is then selected for the corresponding position in the reference sequence. For the set of sequences shown in FIG. 2, the resulting reference sequence is then ACGACGTAACGGTAATTCA (SEQ ID NO.: 2).

As with the embodiment shown in FIG. 4A, the reference sequence generated using process 500 may then be used to encode the set of sequences based on a position: base value as was shown in FIG. 4C. For the embodiment of FIG. 5A, the resulting data is shown in FIG. 5C in table 590. In this case, no particular sequence was chosen as the reference, so there will typically be encoded data associated with each of the sequences in the encoded set (as opposed to the example of FIG. 4C, where sequence 3 required no position: base data). In this example, the encoded data would require 96 bits (i.e., 40 for the reference sequence plus 9 times 8 bits), as compared to 103 bits in the example of FIG. 4C.

Although the examples of FIGS. 4 and 5 illustrate two embodiments of methods for reference sequence generation, it is apparent that other methods of reference sequence generation may also be used in alternate implementations within the spirit and scope of the invention.

In some embodiments, in addition to reference sequence generation and use in encoding sequences, a dictionary may be used to further compress the sequence data. For example, in the reference sequence encoded data shown in table 590, several of the position: base pairs occur repeatedly. The pair 4:C occurs twice, as do the pairs 20:G and 9:T. In actual genomics data based on a larger number of sequences in a set, it is likely that many pairs will repeat across the sequences. This characteristic can be used to further compress the data by selecting or generating a dictionary to further encode the position: base pairs using a dictionary process. This approach may include encoding the most frequently occurring pairs with the fewest number of bits.

One example of an embodiment of a dictionary is shown in table 600 of FIG. 6A. In this example, the position: base pairs shown in FIG. 5C are assigned a coding or dictionary location value from 0 through 4. This value could be represented in various ways using binary data. For example, each of the coding values could be represented using 3 bits (for a total or 8 position: base pairs, which leaves 3 additional pairs available in this example). Alternately, the coding values may be assigned based on the frequency of occurrence of the position: base pair, with the most frequently occurring pairs being assigned fewer bits than the least frequently occurring pairs. Other coding schemes may also be used in various implementations.

FIG. 6B illustrates the resulting dictionary coded data for the compressed sequences of FIG. 5C in table 650 using dictionary processing. In this encoding, sequence 0 is coded merely as the encoded value “1,” which represents that the only variation from the reference sequence corresponds to the position: base value stored in dictionary position 1 (i.e., 4:C). Using this encoding, the coded sequence data size is 64 bits (i.e., 40 bits for the reference sequence plus 3 bits/value times 8 dictionary values or 24 bits). If the size of the reference sequence is excluded (which approximately the proportionately small contribution to the total bit size for large sets of sequences), only 24 bits are needed to encode the sequence data, which corresponds to approximately 80 percent compression.

In some implementations, the above-described processes may be implemented in an iterative fashion. For example, a count of reference entries to dictionary entries may be made to determine if the reference sequence should be updated. This may be repeated to maximize the degree of compression. Determination to update a reference sequence may in part be dependent on the size and sequence similarity or level of homology of the source database entries and the additional cost of storing a new reference sequence.

In many actual sets of genomic sequences, the total sequence size varies from individual sequence to sequence. Consequently, in addition to differences between position values from sequence to sequence (i.e., substitutions), nucleotide bases are sometimes added or removed, from one sequence relative to another sequence, depending on which sequence is considered the reference. This may be due to insertions or deletions as described previously. An example of this type of mutation is shown in FIG. 7, which illustrates two additional sequences (sequences 7 and 8), which may be thought of as additions to those of FIG. 2 in the database. In these two database entries, in addition to differences in bases at position 3 (i.e., G in sequence 6, C in sequence 7), an additional G is inserted at position 8 in sequence 6 (or is corresponding deleted from position 8 in sequence 7 if sequence 6 is considered the reference sequence). These insertions/deletions will generally be required to be accounted for in implementations of sequence compression.

Consequently, the compression embodiments described previously with respect to reference sequences may still be applied to biological data having insertions/deletions, but with additional information added to identify the insertions/deletions (as well as, in some cases, other operations). For example, instead of using the position: base value as a stored delta value, an additional parameter, position:action:value may alternately or additionally be used to describe processing actions. The action may be used to identify processing for decoding the encoded sequence information, with the value being either a base or, in some actions, another parameter such as a size or distance parameter.

One example embodiment of a set of action values and corresponding coding is shown in FIG. 8 in table 800. In this implementation, binary values are associated with particular actions that may be performed on the data during the decoding process (with the values correspondingly added to the compressed sequence data during the encoding process). For example, a 00 value may be used to define no operation (NOP) or may be unused. Substitutions, as were described previously herein, may be identified by a substitution operation (with respect to the reference sequence), and insertions or deletions may be likewise be defined based on a position address or location (as shown in table 800 as codings 10 and 11).

In addition, other operations may be included in the actions coding. For example, in typical sequences, particular bases are known to repeat, sometimes for many positions. Moreover, sub-sequences may also repeat. Actions to encode this repetition may be used as shown in table 800. In addition, inversions in sequences may occur (i.e., base pairs may be inverted with respect to their corresponding elements, A with T and C with G). Inversion of a segment of a sequence refers in addition to the that segment being in a reverse orientation or opposing polarity of the original state, i.e., a segment of base sequence in a 5′ to 3′ orientation being flipped to a 3′ to 5′ polarity. Example codings for repetitions and inversions are shown in FIG. 8 as codes 100 thru 110.

It will be apparent that, while the codings shown in FIG. 8 are representative of typically codings that may be used, they are not intended to be in any way limiting. Other codings, as well as other actions, may be used in various implementations.

An example of this is shown in FIG. 2, where a reference database of sequences is used to determine the reference sequence as shown in process 200. While process 200 is shown for purposes of explanation, it is apparent that other methods may be used to determine a reference sequence such that the reference sequence represents a sequence with close similarity with respect to a database of other sequences taken as a whole.

Attention is now directed to FIG. 9, which illustrates an embodiment of a process 900 for encoding genomic sequences using the above-described reference sequence and dictionary processing embodiments. In this implementation, uncompressed source sequences (or in some implementations sub-sequences or sequence fragments), which may be stored in a database such as database 315, in an uncompressed format such as FASTA/GATC (i.e. stored as nucleotides in a text format such as with the characters A, C, G and T), are converted to highly compressed binary representations which may be stored in the form of a compressed binary file in the same database, another database or in other storage devices or systems. This file may include additional information associated with the sequences, which may be in the form of metadata or other data formats.

At stage 905 the source sequences are accessed for further processing. The database information may then be filtered at stage 910. This stage may include an initial processing step to determine characteristics such as, for example, whether the dataset fits the user's criteria for compression based on threshold of similarity in the dataset, as well as what suitable reference sequences may be used for compression based on delta values.

In one exemplary implementation, a sliding nucleotide window may be used to determine similarity. This could be done using a BLAST (Basic Logical Alignment Search Tool) or BLAST-like algorithm, which is an algorithm for comparing primary biological sequence information. A BLAST search enables comparing a query sequence with a library of database sequences so as to identify library sequences that resemble the query sequence above a certain threshold. This can be done in an iterative fashion.

A user criteria for DNA sequence compression may be, for example, a set maximum value for the highest delta value for any one entry in the database against a selected reference sequence. Another example of user criteria may be size, e.g., if the user is interested in compression of bacterial genomic DNA sequences only, then it would be expected that a sequence would be relatively short, for example, on the order of 10⁶bases as compared to the human genome having about 3 million bases.

For reference sequence selection, a user or processing system may have some knowledge of the type of data contained within databases being used. As discussed previously, the degree of compression will generally be related to the similarity of the sequences. Therefore, in a database containing a million entries of influenza virus, or alternately a particular human gene (such as, for example, the BRCA1 gene), a known sequence could be preselected as a reference. The delta values determined with this reference may suggest that the reference sequence is sub-optimal for the dataset being compressed. Alternatively, a more suitable reference sequence may be generated or assigned as the source database is pre-processed at stage 910. CAM may be used to facilitate rapid processing by allowing fast and efficient parsing of databases with million deep entries.

As noted previously, one or more reference sequences may be used for encoding. By using more reference sequences, the size of the compressed data may incrementally increase. However, use of fewer reference sequences and/or sub-optimal reference sequences, the degree of compression and/or size of delta database could be affected. If more than one reference sequence is used, each entry in the resultant compressed database would typically need to account for the specific reference sequence used for encoding (which would impact the degree of compression and total size).

As sequences from the database are initially imported, they are typically aligned. This may be done using CAM in a high speed data plane. Using CAM, delta value calculations may be achieved in as little as one clock cycle.

At stage 920, one or more reference sequences are generated or selected, which may be based on minimal delta value calculations, such as was described previously herein. Additional reference sequences may be generated or selected, which may be done using calculated delta values and/or biological relevance of the dataset for more suitable compression. For example, the uncompressed data may first be pre-processed to determine if a particular SNP or change in RFLP or a set profile (variation) might be present in a large portion of entries from the database. In this case, the original reference sequence may be changed; while preprocessing the source database with the initial reference if it becomes apparent that a large number of entries (for example, greater than 50%) contain a SNP or other element at a certain position then that reference sequence may be updated to include these properties so as to reduce the differential values.

Additional reference sequences may be generated or selected in an application specific manner. For example, if the source database contained tens of thousands or millions of complete human genomes, reference may be selected based on delta values within a certain region of the sequence. This may be based on, for example, disease association or other areas of interest.

At stage 940, delta value determination, along with the type of database(s) are used to profile the references. For example, if the database contains biomarker data from breast cancer patients only, other genes that are known to be associated with all of the different forms of breast cancers, in addition to BRCA1, would be present. For example, in this case the delta database would include large deletions and truncations in BRCA1 that are known to be associated with early disease onset like massive tumors before age 30 or, alternatively, these disease symptoms may be known to be associated with hormonal changes that occur after a first child as well. In this case, the deletion or truncation (as described previously herein) can be applied to the selected reference sequence for further compression.

At stage 950, a specific reference sequence is selected, which may be based on a minimum delta value. The reference sequence may be used at stage 960 to generate a dictionary from the dataset. For example, this may be based on all the known mutation events in BRCA1 (not limited to any one gene) correlated with all known clinical and pharmacological effects. Each mutation event within each entry that results in a phenotypic effect as well as silent mutations that are common in several entries can be placed in a dictionary using this approach for further compression of the data. As a result, the processing may take advantage of specific deltas from the reference that are common to multiple entries. A simplified example would be if in the dataset it was observed that a SNP in a certain splice site causes an alternative splicing event resulting in the inclusion of an exon with a premature termination codon upstream of position 1250 then a favorable outcome could be predicted for drug B as shown in Table 2 below.

TABLE 2 Hypothetical Example of BRCA Mutations With Clinical and Pharmacological Associations Pharmacological BRCA1 Mutations Clinical Results Effects G to A at Position Multiple Small Chemical X Inhibits 1286 Tumors Tumor Growth Single Base Deletion at Positive Mammogram Chemical X not Position 932 Result Before Age 25 Effective, Highly Toxis Chemical A Low Toxicity, Low Efficacy Alternative Splice Highly Aggressive Chemical A Combined Junction in the 3^rd with Chemical Z Is Intron Very Effective Any Frame Shift Delayed Disease Chemical B is Most Mutation Resulting Onset Effective Treatment in a Stop Codon Upstream of Position 1250 A to C at Position Most Common in Chemical M Effective 547 Male Patients; and Nontoxic Mild, Slow

At stage 970, a correlation table may be generated. This may include embedding data that provides for application specific compression. For example, mutation events with specific disease association or other phenotype can be coded, embedded and compressed along with the delta values in the database. As one example, methylation levels of CpG regions in or near regulatory and coding regions of certain genes is associated with many types of cancer. Consequently, the hypermethylation of BRCA 1 and any other associated breast cancer genes can be bit encoded at the specific methylated cytosine, information on the methylation of the regions known as CpG Islands can be embedded and this can all be compressed along with source data.

At stage 980, compressed DNA sequence data may be stored, which may be based on selected reference sequence(s), delta values and dictionary code, as well as other embedded data such as clinical or pharmacological data, or other data such as from images, screens, scans or other related data.

In various embodiments, aspects of the present invention may be implemented on a computer system or systems, or may be implemented in specific semiconductor devices such as chips or chipsets, ASICS, customized processors or on programmable devices such as FPGAs or other programmable devices.

Attention is now directed to FIG. 10 which illustrates one example embodiment of a computer system 1000 configured to perform biological sequence processing as described herein. System 1000 includes one or more processors 1010, along with a memory space 1070, which may comprise one or more physical memory devices such as DRAM, SRAM, Phase Change Memory (PCM), Flash or other memory elements known or developed in the art. System 1000 may also include peripherals such as a display 1020, user input output, such as mice, keyboards, etc. (not shown for clarity), one or more media drives 1030, as well as other devices used in conjunction with computer systems (not shown).

System 1000 may further include a CAM device 1050, which is configured for very high speed data location by accessing locations in the memory based on content rather than addresses as is done in traditional memories. In addition, one or more database 1060 may be included to store data such as compressed or uncompressed biological sequences, dictionary information, metadata or other data or information, such as computer files. Database 1060 may be implemented in whole or in part in CAM 1050 or may be in one or more separate physical memory devices.

System 1000 may also include one or more network connections 1040 configured to send or receive biological data from other database or computer systems. The network connection may allow users to receive uncompressed or compressed biological sequences from others as well as send uncompressed or compressed sequences. Network connections may include wired or wireless networks, such as Etherlan networks, T1 networks, 802.11 or 802.15 networks, cellular, LTE or other wireless networks, or other networking technologies known or developed in the art.

Memory space 1070 may be configured to store data as well as instructions for execution on processor(s) 1070 to implement the methods and processes described herein, as well as others within the scope of the invention. In particular, memory space 1070 may include a set of biological sequence processing modules including modules for performing processing functions including reference sequence generation, in module 1080, sequence compression, in module 1082, dictionary processing, in module 1084, metadata receipt, processing, and transmission, in module 1086, data integration, in module 1088, as well as other functions in associated modules (not shown). The various modules shown in system 1000 may include hardware, software, firmware or combinations of these to perform the associated functions. Further, the various modules may be combined or integrated, in whole or in part, in various implementations. In some embodiments, one or more elements shown in FIG. 10 may be implemented in a semiconductor device such as an ASIC, programmable device, special purpose processor, chipset or other digital processing device. For example, in one embodiment, processor 1010, memory 1070 and/or other elements of system 1000 such as the database 1060 and/or CAM 1050 may be implemented in one or more ASICs or other semiconductor devices.

Attention is now directed to FIG. 11, which illustrates details of an embodiment of a database 1060 such as is shown in FIG. 10. The database 1060 may be a single database, computer system memory or data storage configuration, or other storage apparatus, or may comprise a plurality of databases or systems containing the various components shown in FIG. 11. In particular, database 1060 may include a set of source data sequences 1062. These may be any of the various types of biological sequences described herein; however, in an exemplary embodiment source sequences 1062 comprises a set of DNA sequences. The DNA sequences may be of a particular organism, such as a human, and may contain a corresponding set of sequences of various individuals of the species. In some implementations the sequences may be of other types of animals, plants, microbes, or other organisms, and in some implementations the sequences may include synthesized, experimental or computer-generated biological sequences, such as simulated biological sequence data. For example, sequences may be of bacteria, viruses, fungi, other organisms or agents that use DNA, RNA or biopolymer as genomic material, including, for example, prions. In some implementations the source sequences may contain sequences of different species, which may, in some cases, be similar or related species.

In addition, database 1060 may include a set of reference sequences 1064, which, in some embodiments, may be used to perform compression on the source sequences as described herein. In some implementations, the reference sequences may be selected from or generated based on one or more of the source sequences 1062. Alternately or in addition, the reference sequences may be selected for compression based on particular characteristics, such as representation of optimal sequences for compression for a particular species or other characteristics. In some implementations, no reference sequences may be present in the database 1060, with reference sequences selected or determined by a processor, such as processor 1010, based on methods such as those described herein.

In addition, database 1060 may include a set of metadata 1066 as described elsewhere herein. The metadata may include information associated with or otherwise related to sequence information or sequence locations. For example, the metadata may include experimental, clinical and/or pharmacological data associated with a particular segment of the sequences such as with a gene, a part of a gene or a plurality of genes. Other types of metadata may also be included in metadata 1066. In some implementations, metadata 1066 may not be present in database 1060 but may reside in other systems, such as external systems, and may be retrieved or accessed at the time of sequence processing, such as during compression, to be incorporated with the compressed sequence data.

Database 1066 may also include compressed data 1068. Compressed data 1068 may include compressed versions of source sequences 1062, and may also include associated metadata, reference sequence(s) used in compression, dictionary data, and/or other data or information associated with source sequence compression. The compressed data may include delta representations of source sequence information as described elsewhere herein.

Attention is now directed to FIG. 12, which illustrates an embodiment of a process 1200 for performing data compression of a set of sequences, which may be stored in a database such as database 315, using one or more reference sequences. At stage 1210, one or more reference sequences may be selected, such as from the sequences in database 315 or from other sequence data sets. Alternately, one or more reference sequences may be generated, such as based on a plurality of sequences in a database such as database 315. At stage 1220, the reference sequence(s) may be used to compress the set of sequences, such as described previously herein. In addition, dictionary processing, such as described previously, may be used at stage 1230 to generate dictionary data for further compression. The dictionary data may be stored in a memory or database or other storage apparatus for use in further compression. At stage 1240, the sequences may be further compressed using the dictionary data, and/or metadata may be added to the compressed sequences. The metadata may be, for example, clinical or pharmacological metadata or other metadata which may be associated with the sequences. The metadata may be provided from a metadata database 1245. At stage 1250, the compressed data, which may include a set of delta representations, dictionary data or information, metadata and/or other data, may be stored in a memory or compressed data database, such as database 1255. Database 1255 may be the same database as source database 315 or may be a separate database.

Attention is now directed to FIG. 13, which illustrates an embodiment of a process 1300 for performing data compression of a set of sequences, which may be stored in a database such as database 315, using one or more reference sequences. At stage 1310 a reference sequence may be selected (or updated on a subsequent iteration). At stage 1320, a set of sequences from source database 315 may be compressed, which may result in generation of a set of delta representations of the source sequences. At stage 1330 dictionary processing, such as described previously herein, may be performed on the compressed data, such as on the delta representations. At stage 1340 a decision may be made as to whether to further update the reference sequence(s), which may be done in an iterative fashion. For example, a determination may be made at stage 1340 as to whether further compression may be achieved using a different reference sequence. If additional reference sequence selection is chosen, process execution may return to a previous stage, such as stage 1310 for updating of the reference sequence.

Alternately, if no further reference sequence updating is chosen at stage 1350, processing may continue to stage 1350 where the compressed data may be stored in a memory or database, such as database 315 or another database.

Attention is now directed to FIG. 14, which illustrates one embodiment of a system 1400 including modules configured for processing sequence data in accordance with aspects of the present invention. Module 1405 is configured for evaluating a plurality of source sequences, which may be biological sequences, stored in a source sequence database 1480, which may correspond with database 315 of FIG. 3. The biological sequences may be DNA sequences, RNA sequences or other biological data sequences. The evaluating may include determining, for sequence positions of the plurality of reference sequences, entries for the sequence positions common to plural of the plurality of data sequences. Module 1410 is configured to define one or more reference sequences based on the source sequence database. the defining may include assigning the entries to respective sequence positions of the first reference sequence. Module 1415 is configured for comparing the defined reference sequence(s) to the source sequences in database 1480. Based on the combining, a set of delta representations may be generated and stored in a memory or database, such as database 1435, which may correspond with database 335 of FIG. 3.

An optional module 1420 may be included to generate additional reference sequences and/or for providing dictionary processing as described previously herein. The dictionary processing may include replacing data with address information or other indexing information. The additional reference sequences may be used to generate additional delta representations, which may be used for further compression. An optional module 1430 may be included, module 1430 configured for storing the delta representations and/or dictionary processed data and/or added metadata in a database, such as database 1435. A processing decision stage 1440 may be included to allow for iterative processing in one or more of the modules, such as defining additional reference sequences at module 1410.

Attention is now directed to FIG. 15, which illustrates one embodiment of a system 1500 including modules configured for processing sequence data in accordance with aspects of the present invention. Module 1505 is configured for creating a first set of delta representations based on one or more reference sequences, which may have been previously generated or may be stored in a database, such as reference sequence database 1570. The delta representations are created in module 1505 based on source sequences in database 1580, which may correspond to database 315 of FIG. 3. Module 1510 is configured to modify the reference sequence(s) by modifying the selected reference sequence(s) which may be based on one or more characteristics of the delta representations. An optional module 1515 may be included to generate a second set of delta representations based on the modified reference sequence. An optional dictionary module 1520 may be included for performing dictionary processing on the first set of delta representations or a second or subsequence set of delta representations. An optional metadata module 1530 may be included to add metadata, such as clinical, pharmaceutical or other metadata to the compressed sequence data.

A storage module 1540 may be included for storing the compressed data in a memory or database, such as database 1535, which may correspond with database 335 of FIG. 3. A processing decision stage 1560 may be included to allow for iterative processing in one or more of the modules, such as further modifying reference sequences at stage 1510.

Attention is now directed to FIG. 16, which illustrates one embodiment of a system 1600 including modules configured for processing sequence data in accordance with aspects of the present invention. Module 1605 may be configured to process sequence data relative to a reference sequence, which may be selected from a reference sequence database 1670. The processing may include generating compressed data from sequence data that may be stored in a source sequence database 1680, which may correspond with database 315 of FIG. 3. Module 1610 is configured to receive metadata, such as from a metadata database 1690 or other source of metadata and combine the metadata with the delta representation. The metadata may be associated with particular segments of the source sequences, a certain sequence, or multiple source sequences. The metadata may be clinical data, pharmaceutical data and/or other types of metadata. Module 1620 is configured to store the compressed data in a memory or database, such as database 1635, which may correspond with database 335 of FIG. 3. A processing decision stage 1630 may be included to allow for iterative processing in one or more of the modules, such as further adding metadata to delta representations at stage 1610, for further iterating reference sequences, for further performing dictionary processing or for other iterative processing.

Attention is now directed to FIG. 17, which illustrates one embodiment of a system 1700 including modules configured for processing sequence data in accordance with aspects of the present invention. Module 1705 may be configured for selecting one or more reference sequences, which may be selected from a sequence in a source sequence database 1780, which may correspond with database 315 of FIG. 3. The sequence may be selected such as described previously herein. Module 1710 is configured to compress the a set of sequences, which may be stored in database 1780, using the selected reference sequence(s). Module 1715 may be further configured to generate a second set of delta representations using a dictionary process, such as described previously herein. Module 1720 is configured for creating a modified reference sequence(s) by modifying the reference sequence(s) based on the plurality of second delta representations. Module 1725 is configured to compress the source sequences using the modified reference sequence, typically to achieve higher compression. Module 1730 is configured to store the compressed data in a memory or database, such as database 1735, which may correspond with database 335 of FIG. 3. A processing decision stage 1740 may be included to allow for iterative processing in one or more of the modules, such as further compressing data sequences at stage 1710, further iterating reference sequence modification and/or dictionary processing, and/or for other iterative processing.

Attention is now directed to FIG. 18, which illustrates one embodiment of a system 1800 including modules configured for processing sequence data in accordance with aspects of the present invention. Module 1805 may be configured for selecting one or more species specific reference sequences from a reference sequence database 1870. The particular species may be associated with an animal species sequence, plant species sequence or other sequence. The selected species may be selected from a sequence in the sequence database 1880, which may correspond to database 315 of FIG. 3, or may be selected independent of the sequences in the database 1880. Module 1810 is configured to compare the selected reference sequence(s) with the source sequences, which may be stored in database 1880. Module 1815 is configured to generate a set of delta representations based on the selected reference sequence(s) and the source sequences. The delta representations may represent a compressed version of the source sequences. An optional module 1820 may be included, with module 1820 configured to perform dictionary processing, such as is described previously herein. An optional module 1825 may be included to incorporate metadata into the delta representations, so as to create compressed data including metadata. The metadata may be associated with various clinical, pharmaceutical and/or other characteristics of the sequences, such as described previously herein.

Module 1830 is configured to store the compressed data (which may include delta representations, dictionary processed data such as address information, and/or metadata) in a compressed sequence database 1835, which may correspond to database 335 of FIG. 3, for further analysis, processing, storage and/or transmission. A processing decision stage 1840 may be included to allow for iterative processing in one or more of the modules, such as generating further delta representations at module 1815.

The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.

In some configurations, systems and apparatus for sequence compression include means for performing various functions as described herein. In one aspect, the aforementioned means may be a processor or processors and associated memory in which embodiments reside, such as is shown in FIG. 10, and which are configured to perform the functions recited by the aforementioned means. The may be, for example, modules or apparatus residing in system 1000 and stored in memory space 1070. In another aspect, the aforementioned means may be a module or any apparatus configured to perform the functions recited by the aforementioned means. In another aspect, the aforementioned means may be one or more integrated circuits configured to perform the functions recited by the aforementioned means. In various implementations, the means may be configured to implement the various functions, methods, processes and/or aspects as described herein, as well as others within the scope of the invention.

In one or more exemplary embodiments, the functions, methods, processes and/or aspects described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, they may be stored on or encoded as one or more instructions or code on a computer-readable medium. The computer-readable medium may be part of a computer program product and may contain instructions for causing a computer to perform the methods, processes, functions and/or aspects described herein. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computer or processor system. By way of example, and not limitation, such computer-readable media can comprise RAM, DRAM, SRAM, PCM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, processor or other programmable instruction-based apparatus. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

It is understood that the specific order or hierarchy of steps or stages in the processes and methods disclosed are examples of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged while remaining within the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.

Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps or stages of a method, process or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC or other semiconductor device. The ASIC may reside in a user terminal or other user system. In the alternative, the processor and the storage medium may reside as discrete components.

The claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more. A phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover: a; b; c; a and b; a and c; b and c; and a, b and c.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. It is intended that the following claims and their equivalents define the scope of the invention.

Claims

1. A computer-implemented method of compressing a plurality of data sequences for use in a system including at least a processor and a memory storing a database of biological sequence data, the method comprising:

determining, based upon comparison of the plurality of data sequences prior to compression of the plurality of data sequences, a reference sequence wherein the reference sequence includes data from the plurality of data sequences;

processing the plurality of data sequences relative to the reference sequence in order to determine differences between the plurality of data sequences and the reference sequence;

creating, using the reference sequence and based upon the differences, a plurality of first delta representations of the plurality of data sequences wherein the creating includes specifying, for each of the differences in ones of the plurality of first delta representations, a corresponding position in the modified reference sequence, an operation, and a value associated with the operation;

defining a modified reference sequence;

generating, using the modified reference sequence, a plurality of second delta representations of the plurality of data sequences;

storing the plurality of first delta representations and the plurality of second delta representations in the database, thereby representing the plurality of first delta representations in a compressed format and conserving resources of the memory; and

transmitting at least one of the plurality of first delta representations and one of the second plurality of representations.

2. The computer-implemented method of claim 1 further including evaluating the plurality of first delta representations wherein the defining includes generating the modified reference sequence by modifying the reference sequence based upon the evaluating.

3. The computer-implemented method of claim 1, further comprising replacing a portion of one of the plurality of first delta representations with address information associated with a dictionary structure, the portion of the first delta representation being stored within the dictionary structure.

4. The computer-implemented method of claim 2, further comprising replacing a portion of one of the plurality of second delta representations with address information associated with a dictionary structure, the portion of the second delta representation being stored within the dictionary structure.

5. The computer-implemented method of claim 2, wherein the defining is further based upon one or more characteristics of the plurality of first delta representations, wherein the one or more characteristics relate to a set of delta values associated with one or more predefined regions of the plurality of data sequences.

6. The computer-implemented method of claim 5, wherein the plurality of sequences comprise DNA sequences and wherein the one or more predefined regions are known to be associated with one or more disease conditions.

7. The computer-implemented method of claim 5, wherein the plurality of sequences comprise DNA sequences and wherein the one or more predefined regions are predicted to be associated with one or more disease conditions.

8. The computer-implemented method of claim 5, wherein the plurality of sequences comprise DNA sequences and wherein the one or more predefined regions are not known to be associated with a disease condition.

9. The computer-implemented method of claim 1, wherein the defining includes:

identifying at least one sequence segment common to plural ones of the plurality of first delta representations; and

modifying the reference sequence in accordance with the at least one sequence segment.

10. The computer-implemented method of claim 9, wherein the plurality of data sequences comprise DNA data sequences and wherein the at least one sequence segment comprises a mutation.

11. The computer-implemented method of claim 1, wherein the plurality of data sequences are stored within a database, the database further including embedded data associated with segments of the plurality of data sequences.

12. The computer-implemented method of claim 11, wherein the plurality of data sequences comprise DNA data sequences and wherein the segments comprise mutations.

13. The computer-implemented method of claim 11, further including replacing at least a portion of the embedded data with address information associated with a dictionary structure, the at least a portion of the embedded data being stored within the dictionary structure.

14. The computer-implemented method of claim 13, wherein the plurality of data sequences comprise DNA data sequences and wherein the embedded data comprises correlative information relating to mutations within the DNA data sequences.

15. The computer-implemented method of claim 14, wherein the correlative information includes pharmacological information.

16. The computer-implemented method of claim 14, wherein the correlative information includes clinical result information.

17. The computer-implemented method of claim 1, wherein the operation corresponding to one of the differences comprises substitution.

18. The computer-implemented method of claim 1, wherein the operation corresponding to one of the differences comprises insertion.

19. (canceled)

20. A computer program product comprising a non-transitory computer readable medium including codes for causing a computer to:

determine, based upon comparison of a plurality of data sequences to be compressed, a reference sequence wherein the reference sequence includes data from the plurality of data sequences;

process the plurality of data sequences relative to the reference sequence in order to determine differences between the plurality of data sequences and the reference sequence;

create, using the reference sequence and based upon the differences, a plurality of first delta representations of the plurality of data sequences wherein for each of the differences, ones of the plurality of delta representations specify a corresponding position in the modified reference sequence, an operation, and a value associated with the operation;

define a modified reference sequence by modifying the reference sequence;

generate, using the modified reference sequence, a plurality of second delta representations of the plurality of data sequences;

store at least one of the plurality of first delta representations and the plurality of second delta representations in a database contained within a memory, thereby representing the plurality of first delta representations and the plurality of second delta representations in a compressed format and conserving resources of the memory; and

transmit at least one of the plurality of first delta representations and the plurality of second delta representations from the memory.

21. A computer-implemented method for compressing sequence data in a system including at least a processor and a memory containing a database, the method comprising:

determining, based upon comparison of a plurality of data sequences prior to compression of the plurality of data sequences, a reference sequence wherein the reference sequence includes data from the plurality of data sequences;

processing the sequence data relative to the reference sequence wherein the processing includes determining differences between the sequence data and the reference sequence;

receiving metadata relating to the sequence data;

generating, based upon the processing, a first delta representation of the sequence data wherein the delta representation includes compressed sequence data representative of the sequence data and embedded data representative of one or more characteristics of the sequence data wherein the embedded data includes the metadata and wherein the generating includes specifying, for each of the differences, a corresponding position in the reference sequence, an operation, and a value associated with the operation;

defining a modified reference sequence by modifying the reference sequence;

generating, using the modified reference sequence, a second delta representation of the sequence data;

storing the first delta representation and the second delta representation in a database, thereby representing the sequence data in a compressed format and conserving resources of the processor and the memory; and

generating a display on a display device using at least one the first delta representation and the second delta representation.

22-55. (canceled)