DNA Sequence Assembly Methods of Short Reads
Certain embodiments of the invention provide systems and methods for the automated assembly of DNA sequence data into contiguous DNA segments using a computer a system. DNA sequence data is entered into the system. The system indexes and groups a plurality of DNA fragment reads utilizing an anchor sequence and consolidates the fragments into larger sequences by merging the fragment reads within a group.
Latest SOFTGENETICS LLC Patents:
This application is based on U.S. Provisional Patent Application No. 61/046,632, filed Apr. 21, 2008, on which priority of this patent application is based and which is hereby incorporated by reference in its entirety.
BACKGROUND OF THE INVENTION1. Field of the Invention
The present invention relates to the field of bioinformatics, specifically to the field of the automated alignment and merging short DNA fragments, and the assembly of these fragments into larger DNA molecules.
2. Description of Related Art
The field of bioinformatics involves the practice of sequence assembly, which refers to the aligning and merging of smaller fragments of a much larger DNA sequence in order to reconstruct that original large sequence. Current sequence technology does not allow the sequencing of very large DNA fragments. Instead, smaller pieces, generally between 20 and 1000 bases, are sequenced and then merged.
The problem of sequence assembly can be compared to passing multiple copies of a book through a shredder and then attempting to piece a single copy of the book back together from only shredded pieces. The resulting book may have many repeated paragraphs while some shreds may be modified to have typos. Excerpts from another book may be added in and some shreds may be completely unrecognizable.
Current sequencing techniques rely on breaking large DNA fragments into small fragments which are then individually sequenced. This procedure is performed in a redundant or overlapping procedure in a way that maximizes the likelihood that all the portions of the larger DNA fragments are sequenced one or more times by the sequencing of the overlapping small fragments. This process results in a logic, or computational problem in that the sequences of the small fragments must be assembled or aligned into larger pieces, which larger pieces are then assembled into still larger pieces in order to create the entire DNA sequence of the large fragment sought to be sequenced.
DNA is a biochemical polymer made up of monomers referred to as “bases” which are conventionally represented by one of four letters, A, T, C, or G. As used herein, the small piece of DNA which is subjected to actual biochemical analysis to determine its base sequence is referred to as a “fragment,” and the data representing the DNA sequence generated from each fragment is referred to as a “fragment read”. Again, in the overall sequencing process, fragment reads are created which are redundant or overlapping to cover most or all sections of the larger DNA piece from which the overlapping was created. The fragment reads must be aligned into one or more contiguous larger segments, such a larger segment being referred to here as a “contig”. The overall layout of fragments into contigs is used to determine the sequence of large fragments of DNA. This process is referred to here as “fragment assembly”.
Because DNA is a polymer, it is common to refer to DNA pieces using the nomenclature of polymers. Hence, the terminology “mer” is used to refer to a sequence of bases in a fragment read. In the conventional terminology used, “mer” refers to a sequence of any length and, when prefixed with the number, is used to refer to a sequence of defined length. Thus a 20-mer is a portion of DNA 20 bases in length.
Technological development of sequencing continues to improve. The Solexa™ technology is available and heavily used to generate roundabout 100 million reads per day on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. The downside is that these reads have a length of only 36 bases. This makes sequence alignment an even more daunting task.
SUMMARY OF THE INVENTIONIn one aspect, the current invention relates to a method for automated assembly of DNA sequence data that includes DNA fragment reads into contiguous DNA segments using a computer system with processing and information storage capabilities, the method including the steps of: entering into the computer storage information representing the DNA sequence data from a plurality of DNA fragment reads; indexing the fragment reads using an anchor sequence, the anchor sequence an occurrence of a mer of length n, whereby a fragment read is indexed by at least one anchor sequence; grouping fragment reads according to said anchor sequence; and consolidating the grouped fragment reads into larger sequences by merging fragment reads within a group of fragment reads. In another aspect, this method further includes the steps of: grouping fragment reads grouped according to an anchor sequence into further subgroups according to a similar shoulder sequence; and matching sequence reads within each subgroup thereby creating assemblies of said sequence reads within each respective subgroup. In an additional aspect, the method also includes the step of elongating at least one fragment read by pooling consolidated regions of indexed areas of said fragment read to assemble the fragment reads into contiguous segments of DNA sequence.
In one embodiment of the inventive method, the average read length is increased in the range of 1.4-1.6, and the Indels and/or single nucleotide polymorphisms (SNPs) are preserved. In a further embodiment, the method includes the step of aligning an elongated fragment read to a user defined sequence read to determine SNP and Indels. In another embodiment, the step of grouping fragment reads includes scanning the fragment reads to pick from the mers occurring in each fragment read at least one n-mer, and storing in said computer storage file a fragment read having said n-mer occurrence therein. In a further embodiment, low frequency errors are eliminated, total read counts are reduced and consensus sequence errors are reduced, below 0.5%. In a further embodiment, the anchor sequence is 12 bases.
In one aspect, the present invention pertains to a sequence assembly system for transforming DNA sequence information from DNA fragment reads into contigs of contiguous DNA sequences, including a computer processor, memory, and data storage devices, the memory having programming instructions to operate the computer processor to consolidate a set of fragment reads. In a further aspect, the computer processor outputs to a display a user interface window and the window further displays one or more of a whole genome pane, an aligned sequence pane, and a consensus sequence pane.
In a further embodiment of the inventive system, the programming instructions are operable to: store information representing the DNA sequence data from a plurality of DNA fragment reads; index the fragment reads using an anchor sequence, the anchor sequence an occurrence of a mer of length n, whereby a fragment read is indexed by at least one anchor sequence; group fragment reads according to said anchor sequence; group fragment reads grouped according to an anchor sequence into further groups according to a similar shoulder sequence; and consolidate the grouped fragment reads into larger sequences. In a further aspect, the system further includes the computer processor to output to the display a user preferences window. The preferences including choices to programmatically control the processor of said assembly system with rules, with the rules comprising: Counts Selection Rules; Directional Limitations; Shoulder Selection Rules; and 454 Jumping Rules. In one aspect, the rules include an anchor sequence dynamically adjustable and the 5′ ends is given more statistical weight then the 3′ ends of the fragment reads.
In an additional embodiment of the invention, the 454 jumping rules further includes slicing a fragment read into multiple sections, where a section includes at least 12-mer fragments and the fragment reads are sliced at the mer positions having greater than 2 homopolymers where the portions of the sequence without large homopolymers are conserved. In a further embodiment, the system also includes programming instructions operable to calculate a known Indel by aligning a consolidated and elongated fragment read with a known reference sequence to determine Indel location. The system may further include programming instructions operable to calculate a known Indel by aligning a consolidated and elongated fragment read with a known reference sequence to determine SNP location.
An additional aspect of the invention comprises a set of computer programming instructions embodied on a computer readable medium for execution on a computer processor having programming instructions thereon for sequence assembly transforming DNA sequence information from DNA fragment reads into contigs of contiguous DNA sequences, comprising instructions operable to consolidate and elongate a set of fragment reads.
In a further embodiment, the present invention pertains to a sequence assembly system for transforming DNA sequence information from DNA fragment reads into contigs of contiguous DNA sequences, that includes: an arrangement for entering into the computer storage information representing the DNA sequence data from a plurality of DNA fragment reads; an arrangement for indexing the fragment reads using an anchor sequence, the anchor sequence an occurrence of a mer of length n, whereby a fragment read is indexed by at least one anchor sequence; an arrangement for grouping fragment reads according to said anchor sequence; and an arrangement for consolidating the grouped fragment reads into larger sequences by merging fragment reads within a group of fragment reads.
With reference to
Computing device 12 also can include an input/output portion 16 containing communications connection(s) that allow the computing device 12 to communicate with other devices and/or networks via an interface 24. Interface 24 can include a wireless interface, a hard-wired interface, or a combination thereof. Input/output portion 16 also can include and/or utilize communication media. Communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limiting, communication media includes wired media, such as a wired network or direct-wired connection, and wireless media, such as acoustic, RF, infrared, and other wireless media. The term computer readable media as used herein includes both storage media and communication media. lriput/output portion 16 also can comprise and/or utilize an input device(s) such as a keyboard, a mouse, a pen, a voice input device, a touch input device, or the like, for example. An output device(s) such as a display, speakers, printer, or the like, for example, also can be included.
Display portion 20 comprises a portion 26 for rendering a DNA sequence assembly window or a portion thereof.
The flow chart of
At step 14, the first indexed read from the index file of step 12 is selected. The fragment reads are then checked to determine if the anchor is present. The anchor sequence can be present in the forward direction as well as the complimentary reverse direction. If the anchor is present, the fragment read is grouped. In the preferred embodiment, a group is formed by clustering all fragment reads having the anchor into a file in memory or storage. However, other types of indexing using techniques such as flagging the fragment read in a database can be used. In step 16, the number of reads, both forward and reverse, must meet the requirements stored in the system to trigger its usage as a viable cluster. At step 18, all of the fragment reads that contain the anchor sequence are now clustered together and the clusters can be further limited by subgrouping based on homologous shoulder sequences. Often, many of the reads within the cluster contain homologous shoulders, namely, common mers both upstream and downstream of the anchor sequence. The fragment read clusters can be shouldered by these linking shoulder regions into groups of similarity. At step 20, consolidation takes place. Consolidation includes both condensation and elongation. During the grouping, the original fragment reads were partitioned based on the anchor sequence and then were subdivided again based on the shoulder regions. For example, two different groups can be formed based on the shoulder regions further subdividing the first group. At this point, a consensus sequence can be generated for each group almost doubling the read lengths. Using the condensed sequences, or consensus sequences, an elongation can lengthen the individual fragment reads. At step 20, the 5′ end is given more weight than the 3′ end statistically because of the higher reliability of the 5′ end. At step 20, an output of a consensus sequence, an elongated fragment read, and/or an error corrected fragment read occurs. At step 22, the processor gathers the next anchor sequence and repeats the process back to step 14 as long as the end of the index table has not been reached.
In
In
With reference to
With reference to
With reference to
With reference to
further reference to
With reference to
With reference to
With reference to
With reference to
With continuing reference to
With continuing reference to
-
- AAAAAAAAAACG A A A A A A A A A A C G AAAAAAAAAACG (SEQ ID NO: 5)
- ⅛x 10 0 0 00 0 0 01 8 8
The total score is: (1+1+8+8)/8=2.25.
This means that the two nucleotides are meaningful based on the periodical index score being less than or equal to one for rejections. Setting 540 indicates condensation for forward in the left direction and reverse for the right direction, therefore the consensus sequences are only taken from the 5′ and anchor sequences and the 3′ sequences are not used to construct a consensus sequence. This is useful because 3′ sequences can be error prone. Setting 544 indicates that low quality ends should be cleaned, trimming the ends if the quality is poor. For example, if the quality score 5′ end nucleotide is 7 and the quality score of the 3′ end nucleotide is 2, and there are two reads, one forward read and one reverse read, if coverage in nucleotide is A, the same call, the score is 9 which is calculated by adding both quality scores of 7 and 2. However, if the forward call is A and the reverse call is G, then the consensus will be A with a score of 5 which is calculated by subtracting 2 from 7. Setting 546 has a check box to indicate repeat index with forward and reverse only to pick up reads only using repeat index and shoulder sequence using forward only or reverse only to condense the reads. A repeat sequence of (ACAC)N can cover more than 40 BPS, which is longer than the reads length of 36 BPS. Therefore, it is impossible to condense the reads for AC greater than the length of the reads. However, with setting 546, the sequence assembly system can pick up the reads only using the repeat index and the shoulder sequence forward or reverse only to condense the reads.
The 454 Jumping Rule setting 532 is used to separate reads which are known to have many errors in homopolymers. Programming instructions on the processor break the fragment reads at the homopolymer positions having greater than 2 homopolymers such as AAA, CCCCC, GGG, TTTT. The portions of the sequence without large homopolymers will be conserved. In the 454 Jumping Rule method, the anchor is greater than 12 BPS. The anchor sequences are underlined in the following example:
The example will show the key words, CCTCCGGCAGGA (SEQ ID NO: 8) with shoulders NNNNNNNNNNNN and ATGAGGCAAGTC (SEQ ID NO: 9); keyword ATGAGGCAAGTC (SEQ ID NO: 9) with shoulders CCTCCGGCAGGA (SEQ ID NO: 8) and CTGGCCGAGGAG (SEQ ID NO: 10); keyword CTGGCCGAGGAG (SEQ ID NO: 10) with shoulders CTGGCCGAGGAG (SEQ ID NO: 10), CATGGCCATCAT (SEQ ID NO: 11); keyword AGGAGGAGGAGA (SEQ ID NO: 12) with shoulders CTGGCCGAGGAG (SEQ ID NO: 10) and CATGGCCATCAT (SEQ ID NO: 11); the pattern repeats. One read will be used multiple times in condensation. The step length is dependent on the local sequence and it is flexible. All of the reads in a group having one keyword and either or both of the shoulders will be aligned to keywords. Shoulders are found in the neighboring keywords or in the same section when the length of the keyword is long. The Indel errors are ignored. Condensation will remove any Indel and sequence errors. If there is any error in the index of keywords, this read will not be used in the condensation. The next keyword or previous keyword will most likely include the sample.
With reference to
After the numbers of homopolymers are corrected, a count of the number of the keywords is conducted. The keywords are then sorted into a region containing similar sequences. If two of the sequences differ by 1 BPS and the frequency of one is less than ⅛th of the other or is only 1 and 2 counts, the low frequency keywords will be corrected to the high frequency keyword. The original sequences are modified to the new sequences. After errors are corrected, software is able to assemble three times longer than original reads.
The sequence assembly system can be used on the following three applications: Deep Sequence to quantify the rare allele frequency in human population; cDNA assembly using 6 cycles to assemble mRNA transcripts; and De novo assembly.
For deep sequence, the sequence error rate of Illumina™ reads is measured about 2%. After condensation the sequence error rate reduces to 0.1%. The sequence error reduction allows us to measure frequency of rare allele. In one example, data sets polled from 364 patients and 360 normal people from ABCA1 gene spanning 150 kb with 50 exons. Each of the two samples is measured in two lanes as replicates. About 7.2 M reads occur in each lane. The coverage is about 8000×. The replicate measurements allow to determination of the system linearity, using one normal as control and patient sample, when the frequency of all of the mutations are measured. The allele frequency is determined from the condensed reads. The numbers of raw reads associated with the condensed reads are used to determine the allele frequency. Four thousand one-hundred and eighty-three (4,183) mutation calls were made after aligning the raw reads, which is reduced to just 77 mutation calls after aligning condensed reads.
cDNA sequences are measured with Illumina™ Genome Analyzer. There are about 14 million reads of 36 bps. The first cycle of condensation results 6.9 M reads of 53 bps. The second cycle gives the read of about 100 bps. Software is able to generate the sequence of 1500 bps after 6 condensation cycles. We are able to assemble into 27000 mRNA transcripts.
De novo sequences with the short reads from genome analyzers produce an additional layer of complexity for assembly. The sequence assembly system, was developed to assist with these complex assembly issues. After using the sequence assembly system tools to remove low quality reads and trim low quality bases from reads in order to improve assembly accuracy and using the software condensation tool to statistically polish the short reads, correcting additional errors and simultaneously lengthening the reads, the sequence assembly system can more reliably assemble the short reads into contigs of 500 bps to upwards of 50 kbps. Additionally, the original short reads used to generate each assembled contig are recorded to show the copy number and Indel positions. The sequence assembly system is capable of detecting Indels of 1-30 bps. The sequence assembly system statistically polishes datasets of adequate coverage to remove random sequencing errors and increase read lengths. Repeating the condensation removes systematic errors and further lengthens the sequence reads with each additional cycle. The polished and elongated reads can be assembled into large contigs while removing redundant reads. Once the dataset has been cleaned to remove low quality reads and ends, the remainder of the process is fully automated through the use of software that guides through the project configuration.
As an example of de novo assembly, a sample of the K-12 DH10B strain of E. coli that was sequenced with the SOLiD System™ was assembled using the sequence assembly system. This dataset has high quality reads suggestive of about 30× coverage. These single reads have low quality reads removed, low quality ends trimmed, and the first color-call removed from reads using the sequence assembly system. The remainder of reads, those of high quality and reliability, were processed through the sequence assembly condensation and assembly system modules to generate larger color-space contigs. Since the DH10B genome is available, the assembled results were then aligned to this genome to validate the assembly results. Four cycles of the sequence assembly system condensation were performed followed by assembly of the condensed results to generate 3220 contigs. Over 85% of these contigs were larger than 500 bps, and the largest is over 16 Kbps. The assembled contigs were aligned to the DH10B reference genome to validate the assembly results. Only three of the assembled reads did not match to the DH10B genome, indicating that the contigs produced by the sequence assembly system are very reliable. In one example, 4,024,290 bases of the 4,686,137 base genome have coverage using the sequence assembly system as described above, approximately 86% coverage. When excluding the duplicated region of this reference genome that is larger than 100 kbps, coverage is 88%.
It will be readily appreciated by those skilled in the art that modifications may be made to the invention without departing from the concepts disclosed in the foregoing description. Accordingly, the particular embodiments described in detail herein are illustrative only and are not limiting to the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalents thereof.
Claims
1. A method for automated assembly of DNA sequence data comprising of DNA fragment reads into contiguous DNA segments using a computer system with processing and information storage capabilities, the method comprising the steps of:
- entering into the computer storage information representing the DNA sequence data from a plurality of DNA fragment reads;
- indexing the fragment reads using an anchor sequence, said anchor sequence an occurrence of a mer of length n, whereby a fragment read is indexed by at least two anchor sequences;
- grouping fragment reads according to said anchor sequence; and
- consolidating the grouped fragment reads into larger sequences by merging fragment reads within a group of fragment reads; and
- generating at least two consensus sequences using the at least two anchors from a read.
2. A method as claimed in claim 1, further comprising the steps of:
- grouping fragment reads grouped according to an anchor sequence into further subgroups according to a similar shoulder sequence; and
- matching sequences reads within each subgroup thereby creating assemblies of said sequence reads within each respective subgroup.
3. A method as claimed in claim 2, further including the step of elongating at least one fragment read by pooling consolidated regions of indexed areas of said fragment read to assemble the fragment reads into contiguous segments of DNA sequence.
4. A method as claimed in claim 3, wherein an average read length is increased in the range of 1.4-1.6.
5. A method as claimed in claim 3, further including preservation of Indels and/or SNPs.
6. A method as claimed in claim 5, further including the step of aligning an elongated fragment read to a user defined sequence read to determine SNP and Indels.
7. A method as claimed in claim 1, wherein the step of grouping fragment reads comprises scanning the fragment reads to pick from the mers occurring in each fragment read at least one n-mer, and storing in said computer storage file a fragment read having said n-mer occurrence therein.
8. A method as claimed in claim 6, wherein low frequency errors are eliminated.
9. A method as claimed in claim 6, wherein total read count is reduced.
10. A method as claimed in claim 1, wherein consensus sequence errors are reduced below 0.5%.
11. A method as claimed in claim 2, wherein the anchor sequence is 12 bases.
12. A sequence assembly system for transforming DNA sequence information from DNA fragment reads into contigs of contiguous DNA sequence, the system comprising a computer processor, memory, and data storage devices, the memory having programming instructions to operate the computer processor to consolidate a set of fragment reads.
13. A system as claimed in claim 12, further comprising said computer processor to output to the display a user interface window, said window further displaying one or more of a whole genome pane, an aligned sequence pane, and a consensus sequence pane.
14. A system as claimed in claim 12, wherein the programming instructions are operable to:
- store information representing the DNA sequence data from a plurality of DNA fragment reads;
- index the fragment reads using an anchor sequence, said anchor sequence an occurrence of a mer of length n, whereby a fragment read is indexed by at least two anchor sequences;
- group fragment reads according to said anchor sequence;
- group fragment reads grouped according to an anchor sequence into further groups according to a similar shoulder sequence;
- consolidate the grouped fragment reads into larger sequences and
- generate at least two consensus sequences using the at least two anchors from a read.
15. A system as claimed in claim 14, further comprising said computer processor to output to the display a user preferences window, said preferences including choices to programmatically control said processor of said assembly system with rules, said rules comprising:
- Counts Selection Rules;
- Directional Limitations;
- Shoulder Selection Rules; and
- 454 jumping rules.
16. A system as claimed in claim 15, wherein the rules include an anchor sequence dynamically adjustable.
17. A system as claimed in claim 16 wherein the 5′ end is given more statistical weight then the 3′ ends of the fragment reads.
18. A system as claimed in claim 12, wherein 454 jumping rules further comprises slicing a fragment read into multiple sections, wherein a section includes at least 12-mer fragments and the fragment reads are sliced at the mer positions having greater than 2 homopolymers wherein the portions of the sequence without large homopolymers are conserved.
19. A system as claimed in claim 12, further including programming instructions operable to calculate a known Indel by aligning a consolidated and elongated fragment read with a known reference sequence to determine Indel location.
20. A system as claimed in claim 12, further including programming instructions operable to calculate a known Indel by aligning a consolidated and elongated fragment read with a known reference sequence to determine SNP location.
21. A set of computer programming instructions embodied on a computer readable medium for execution on a computer processor having programming instructions thereon for sequence assembly transforming DNA sequence information from DNA fragment reads into contigs of contiguous DNA sequence, the computer program instructions comprising instructions operable to consolidate and elongate a set of fragment reads.
22. A sequence assembly system for transforming DNA sequence information from DNA fragment reads into contigs of contiguous DNA sequence, the system comprising:
- a means for entering into the computer storage information representing the DNA sequence data from a plurality of DNA fragment reads;
- a means for indexing the fragment reads using an anchor sequence, said anchor sequence an occurrence of a mer of length n, whereby a fragment read is indexed by at least two anchor sequence;
- a means for grouping fragment reads according to said anchor sequence; and
- a means for consolidating the grouped fragment reads into larger sequences by merging fragment reads within a group of fragment reads;
- a means for generating at least two consensus sequences using two anchors from a read.
Type: Application
Filed: Apr 21, 2009
Publication Date: Dec 24, 2009
Patent Grant number: 8271206
Applicant: SOFTGENETICS LLC (State College, PA)
Inventors: Changsheng Jonathan Liu (State College, PA), Yiqiong Wu (Fayetteville, AR), Kevin Jay LeVan (State College, PA)
Application Number: 12/427,409
International Classification: C40B 50/02 (20060101); G06F 3/048 (20060101); G06F 12/00 (20060101);