System and Method for Consensus-Calling with Per-Base Quality Values for Sample Assemblies
The present teachings disclose a method for evaluation of a polynucleotide sequence using a consensus-based analysis approach. The sequence analysis method utilizes quality values for a plurality of aligned sequence fragments to identify consensus basecalls and calculate associated consensus quality values. The disclosed method is applicable to resolution of single nucleotide polymorphisms, mixed-based sequences, heterozygous allelic variants, and heterogeneous polynucleotide samples.
Latest APPLIED BIOSYSTEMS INC. Patents:
CROSS-REFERENCE TO RELATED APPLICATIONS
This application is a divisional of application Ser. No. 10/279,746, filed Oct. 23, 2002, which claims a priority benefit to Provisional Application No. 60/336,278, filed Oct. 25, 2001 and Provisional Application No. 60/396,240, filed Jul. 15, 2002, all of which are incorporated herein by reference.
The present teachings generally relate to nucleic acid analysis, and in various embodiments, to a system and methods for sequence data processing and consensus sequence analysis.
2. Description of the Related Art
Advances in automated nucleic acid sequence analysis have revolutionized the fields of cellular and molecular biology. As a result, it is now feasible to sequence whole genomes as is evidenced by the completion of sequencing the 3-billion-base human genome. When using automated systems, it is important to maintain a high degree of accuracy with respect to the identification of individual nucleotide bases. Oftentimes, base identification is predicated upon raw data obtained from electrophoretic and/or chromatographic information which is resolved to identify each base within a sequence undergoing analysis. Numerous factors may affect this analysis including, for example, the base composition of the sequence, experimental and systematic noise, migration anomalies (compressions and stretches), variations in observed signal strength for the detected bases, and variations in reaction efficiencies. The presence of mixed-bases in a sample may present further difficulties for conventional systems to properly resolve and identify. Mixed-bases may be representative of sequence variants contained within a sample and may arise from allelic variation or genetic heterozygosity. Mixed-bases may also represent regions within a sample sequence where more than one putative base can be identified. Conventional systems may overlook or erroneously identify these regions thereby degrading the accuracy of the sequence analysis. As a result, there is a need for an improved methodology by which mixed-bases can be identified and evaluated.
In various embodiments, the methods described herein desirably make use of per-base quality values for input sequence fragments and generate a single output quality value for each consensus basecall. In generating this basecall, the methods take into account factors which may undesirably skew many conventional quality value assessment routine As will be described in greater detail hereinbelow, one method by which such problems are addressed by the present teachings incorporates the use of differential quality value assessment and weighted basecall voting.
An additional feature of the present teachings involves the ability to resolve mixed-basecalls using information from the aligned sequence fragments. Rigorous mixed-basecall resolution may be desirable as it may improve the quality of the output consensus sequence and remove uncertainty related to the presence of mixed-basecalls in the consensus sequence. Furthermore, unlike many conventional methods which may lack the ability to perform detailed analysis on mixed-basecalls, the present teachings may be used to improve mixed-basecall analysis and identify features such sequence heterozygotes, single nucleotide polymorphisms, and heterogeneous sequence mixtures.
In one aspect, the invention comprises an analysis method for evaluating the composition of a sample sequence comprising a plurality of nucleotide bases, the method further comprising: Receiving assembly information for at least one sequence fragment wherein the assembly information comprises a plurality of putative basecalls spanning at least a portion of the sample sequence; Resolving the assembly information to thereby align the putative basecalls; Identifying quality values associated with at least one of the putative basecalls; and Generating a consensus basecall and a consensus quality value for each of the aligned putative basecalls wherein the consensus basecall corresponds to a predicted base within the sample sequence and the consensus quality value corresponds to a calculated degree of confidence in the consensus basecall obtained, in part, from the quality values for the at least one putative basecalls.
In another aspect, the invention comprises a basecalling method for predicting the composition of a polynucleotide sequence, the method further comprising: Receiving information for a plurality of sequence fragments comprising basecalls spanning at least a portion of the sample sequence and corresponding quality values indicative of a calculated degree of confidence in the basecalls; Aligning the plurality of sequence fragments to identify regions of basecall overlap between the sequence fragments; and Calculating a consensus basecall and a consensus quality value for the regions of basecalling overlap in the aligned sequence fragments wherein the consensus quality values are determined using the quality values for the basecalls of the sequence fragments.
In still another aspect, the invention comprises a method for basecall resolution during sequence analysis, the method further comprising: Comparing a plurality of initial basecalls corresponding to one or more overlapping sequence fragments; Identifying agreement between the initial basecalls as a consensus basecall; Performing a re-calling operation when there is a lack of agreement between the initial basecalls using a more rigorous basecalling routine to generate one or more stringent basecalls and thereafter identifying agreement between the initial basecalls and stringent basecalls as the consensus basecall; Performing a weighted vote for those basecalls that lack of agreement between the initial basecall and the stringent basecall to identify the consensus basecall as the basecall with the greatest weighted vote; and Determining a quality value for the consensus basecall.
In a still further aspect, the invention comprises a system for predicting the composition of a polynucleotide sequence, the system further comprising: A sample processing module that receives information for a plurality of sequence fragments, the sample processing module providing functionality for identifying basecalls spanning at least a portion of the sample sequence and corresponding quality values indicative of a calculated degree of confidence in the basecalls; A specimen processing module that assembles the plurality of sequence fragments to identify regions of basecall overlap between the sequence fragments; and A project processing module that calculates a consensus basecall and a consensus quality value for the regions of basecalling overlap in the assembled sequence fragments wherein the consensus quality values are determined using the quality values for the basecalls of the sequence fragments.
In yet another aspect, the invention comprises a system for basecall resolution comprising: At least one module which provides functionality for comparing a plurality of initial basecalls corresponding to one or more overlapping sequence fragments so as to identify agreement between the initial basecalls to generate a consensus basecall; the at least one module further used to perform a re-calling operation when there is a lack of agreement between the initial basecalls using a more rigorous basecalling routine to generate one or more stringent basecalls and thereafter identifying agreement between the initial basecalls and stringent basecalls as the consensus basecall; the at least one module further performing a weighted vote for those basecalls that lack of agreement between the initial basecall and the stringent basecall to identify the consensus basecall as the basecall with the greatest weighted vote and thereafter determining a quality value for the consensus basecall.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other aspects, advantages, and novel features of the present teachings will become apparent upon reading the following detailed description and upon reference to the accompanying drawings. In the drawings, similar elements have similar reference numerals.
FIGS. 1A,B illustrate exemplary chromatograms for sample polynucleotides.
FIGS. 2A,B illustrate exemplary chromatograms having mixed-base features.
DESCRIPTION OF THE CERTAIN EMBODIMENTS
Reference will now be made to the drawings wherein like numerals refer to like elements throughout. As used herein, “target”, “target polynucleotide”, “target sequence” and “target base sequence” and the like refer to a specific polynucleotide sequence that is subjected to any of a number of sequencing methods used to determine its base composition (e.g. base sequence). The target sequence may be composed of DNA, RNA, analogs thereof, or combinations thereof. The target may further be single-stranded or double-stranded. In sequencing processes, the target polynucleotide that forms a hybridization duplex with a sequencing primer may also be referred to as a “template”. A template serves as a pattern for the synthesis of a complementary polynucleotide (Concise Dictionary of Biomedicine and Molecular Biology, (1996) CPL Scientific Publishing Services, CRC Press, Newbury, UK). The target sequence may be derived from any living or once living organism, including but not limited to prokaryote, eukaryote, plant, animal, and virus, as well as synthetic and/or recombinant target sequences.
Furthermore, as used herein, “sample assembly” and “assembly” refer to the reassembly or consensus analysis of smaller nucleotide sequences or fragments, arising from individually sequenced samples that may comprise at least a portion of a target sequence. By combining the information obtained from these fragments a “consensus sequence” may be identified that reflects the experimentally determined base composition for the target sequence.
Nucleic acid sequencing, according to the present teachings, may be performed using enzymatic dideoxy chain-termination methods. Briefly described, these methods utilize oligonucleotide primers complementary to sites on a target sequence of interest. For each of the four possible bases (adenine, guanine, cytosine, thymine), a mixed population of labeled fragments complementary to a least a portion of the target sequence may be generated by enzymatic extension of the primer. The fragments contained in each population may then be separated by relative size using electrophoretic methods, such as gel or capillary electrophoresis, to generate a characteristic pattern or trace. Using knowledge of the terminal base composition of the oligonucleotide primers along with the trace information generated for each reaction allows for the base sequence of the target to be deduced. For a more detailed description of sequencing methodologies the reader is referred to DNA sequencing with chain-terminating inhibitors, Sanger et. al., (1977) and A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides, Prober at al. (1987).
The aforementioned sequencing methodology may be adapted to automated routines so as to permit rapid identification of sample sequence compositions. In one exemplary automated application, polynucleotide fragments corresponding to the sample sequence are labeled with fluorescent dyes to distinguish and independently resolve each of the four bases in a combined reaction. In one aspect, a laser tuned to the excitation wavelength of each dye may be used in combination with a selected electrophoretic resolving/separation method to generate a distinguishable signal for each base. A detector may then transform the emission or intensity signal information into a chromatographic trace representative of the composition of the sample sequence. The resulting data may then be subsequently processed by computerized methods to determine the base sequence for the sample. For a more detailed description of a conventional automated sequencing system the reader is referred to DNA Sequencing Analysis: Chemistry and Safety Guide ABI PRISM 377 (Applied Biosystems, CA) and ABI PRISM SegScape™ Software Version 1.1 User Guide (Applied Biosystems, CA).
One confounding characteristic of many sequencing traces that may affect the basecalling accuracy of conventional systems is that for any given peak position, signals may be present which correspond to one or more of the bases. Thus, for a selected peak position 130, a plurality of signal components 140-143 may be observed which may correspond to a G-signal component 140, an A-signal component 141, a T-signal component 142, and/or a C-signal component 143. The intensity of each detected base component is related to many factors and may include noise and reaction efficiency variations.
In one aspect, when performing base identification or “basecalling” it is necessary to distinguish spurious or noise-related signals 145 from the base peaks 140 corresponding to the actual base present within the selected peak position 130. To this end, basecalling operations may incorporate functions that manipulate and/or normalize the chromatographic data to account for systematic and experimental variations to thereby aid in the resolution and identification of bases corresponding to each of the peaks at the selected peak positions.
When analyzing chromatographic traces, for each selected peak position 130 there may be two or more identifiable peaks indicating a single basecall may not be the appropriate basecall.
An important consideration when identifying mixed-bases within a sample sequence is that the actual base composition at the region of where the mixed-base is observed may reflect a single base or more than one base. As shown in
Consensus sequence analysis is one method for improving the accuracy of basecalling, including discrimination in mixed-base regions, and may be used to assess multiple sample sequence fragments. In one aspect, consensus sequencing comprises an evaluation of redundant or overlapping sequence fragments that correspond to at least a portion of the sample sequence of interest. During this analysis, the results from the sequence fragments are included in a combined analysis wherein some overlapping or redundant sequence information may be obtained for the sample sequence. Using redundant sequence information in this manner provides a means to improve the validation of basecalls as compared to single sequence fragment analysis alone.
The various embodiments of the present teachings provide means for identification of the composition of mixed-base regions and may aid in distinguishing mixed-bases called in these regions from noise or other confounding factors. Additionally, the basecalling methods described herein may be used to resolve mixed-base sequences to identify one or more associated pure-bases. One factor which presents a problem for many conventional systems is the presence of sequencing chemistry noise. Noise of this type may appear in the electropherogram and may possess a similar signature to that of a mixed-base. Improper calling of such noise may result in the generation of undesirable false positive basecalls. In one aspect, false positive basecalls may comprise mixed-basecalls that are in actuality pure (single) bases. Various embodiments of the present teachings desirably avoid or reduce the number of false positive basecalls by comparing information from sequence fragments and associated quality values. Another issue that may complicate mixed-base identification is differential incorporation of dideoxy-nucleotides using enzymes such as sequencing polymerases. Differential incorporation may result in peaks which are not of equal height in the electropherogram. This problem may be further compounded by alleles that may be present in a sample in non-equivalent proportions (e.g. Ratios greater than or less than 50:50). For example, two different sequence variants of an allele may be present in a ratio of 10:90. In such instances, the expected peaks for the corresponding base of the sequence variants may not be resolved or identified with a peak ratio of 10:90 as a result of differential incorporation. Using quality value analysis in the manner described herein, the present teachings may desirably improve the ability to distinguish bases and overcome these potential problems which often confound conventional systems resulting in improved basecalling accuracy.
In various embodiments, the system and methods described herein present a novel approach for determining the base sequence for a target polynucleotide based, in part, upon a consensus calling approach that utilizes per-base quality values along with a dedicated base discrimination method. As will be described in greater detail hereinbelow, this approach may improve basecalling accuracy as compared to that of conventional algorithms and may be readily adapted for use with software analysis packages including SegScape™ sequence analysis software (Applied Biosystems, CA) and hardware sequencing instrumentation including ABI Prism® DNA Analyzers (Applied Biosystems, CA).
One method for analyzing electropherograms identifies bases of the sample fragments and the resultant assembled sample sequence while assigning quality values (QV) to each of the called bases. (See B. Ewing, et al., “Base Calling Of Automated Sequencer Traces Using Phred. I. Accuracy Assessment”, Genome Research, vol. 8(3), pp. 175-185 (1998); B. Ewing and P. Green, “Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities”, Genome Research, vol. 8(3), pp. 186-194 (1998)). In these basecalling procedures the quality value represents the measure of reliability for a given basecall and estimates the basecalling error. Generally represented, a probability value (P) may be defined as the probability that a particular basecall is incorrect with a quality value (QV) defined by the expression:
QV=−10 log10 P Equation 1
According to this expression, lower quality values generally indicate a higher probability of basecall error and higher quality values indicate a greater degree of certainty for an accurate basecall. In various embodiments, the consensus-calling methods described herein improve quality value assessment by evaluating per-base quality values for one or more sequence fragments with respect to one another and generating a consensus quality value which, in one aspect, may be indicative of an overall or combined approximation of certainty for the basecalls associated with the sample sequence. It will be appreciated that these methods may adapted for use with existing consensus-calling applications as well as raw and processed sequencing information obtained from numerous sources. Additionally, details of these quality value assessment and error probability estimation routines may be found in commonly assigned U.S. patent application Ser. No. 09/658,161 filed Sep. 8, 2000 and entitled: A system and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms, which is hereby incorporated by reference in its entirety.
It will be appreciated that although the aforementioned exemplary input 405 is illustrated as being received by the consensus-calling methods, at least a portion of this information may be optional or readily calculated by various integrated functions prior to consensus analysis and therefore need not be available prior to consensus analysis. Additionally, various operations may be utilized to convert the input data types and information (such as those produced by TraceTuner™ and Phred) into a form compatible with the consensus-calling methods.
As will be described in greater detail hereinbelow, once the desired input data 405 has been acquired and/or calculated, a series of consensus-calling operations 470 may be performed to produce output 475. This output may comprise information such as a consensus sequence 480 and a quality value assessment 490 for each base. In one aspect, the consensus sequence 480 represents the calculated sample sequence assembly which may include a predicted base sequence as well as gap information.
The mixed-base noise level 510 may be obtained using a sliding window averaging approach. In one aspect, the window size may be flexibly defined to accommodate various experimental data types and compositions. In various embodiments the window size ranges between approximately 5-20 base pairs in length. The local mixed-base noise level 510 generally reflects noise that may arise, for example, as a result of chemistry and/or instrumental noise which may be present in the electropherogram at each position. Furthermore, the mixed-base noise level 510 may be used to ascertain the quality of input mixed-basecalls derived from the sample assembly information 430. Additionally, mixed-base noise level determination and may be used to supplement input per-base quality values 440 when evaluating the quality or confidence of the sequence data. An exemplary embodiment of the mixed-base noise assessment operation is provided in
One desirable feature of the averaging method for local quality value assessment 520 is that the resultant quality values calculated by this method may provide a smoother estimate of local electropherogram quality for a given position in a sample sequence. Rather than relying solely on quality value information directed towards individual bases, the averaging approach desirably incorporates additional information that helps take into account the local quality of bases in the sequencing trace relative to one another. In one aspect, use of the averaged quality values in the aforementioned manner provides improved input for the consensus calling operations. Furthermore, quality value assessment based on averaged values may provide greater confidence and accuracy when consensus basecalls are subsequently identified. An exemplary embodiment of the local quality value assessment operation is provided in
In various embodiments following the preprocessing operations 505 (if performed), a consensus call assignment routine 530 is performed. The consensus call assignment routine 530 incorporates a heuristic component which provides a basecalling functionality that simulates an expert approach to assessment and assignment of proper consensus calls and sequence assembly. Unlike other machine learning or classification procedures for consensus calling, the consensus call assignment routine 530 does not necessarily require large annotated data sets for training. Furthermore, while conventional machine learning or classification procedures may not be transparent to the user, resulting in potential complications in data review or program development, the consensus call assignment routine 530 is substantially user-transparent.
A further benefit of the consensus call assignment routine 530 results from the openness of the rule-based approach to consensus calling. As a result the methodology described herein is amenable to selective tuning using particular data sets and applications. Furthermore, development may be facilitated as the assignment routine 530 can be readily modified to accommodate other types of data sets and sample input. For example, it is conceived that the methods described herein may be adapted for use in sequence analysis involving protein or peptide samples in additional to nucleic acid samples.
As will be described in greater detail hereinbelow, the consensus call assignment routine 530 comprises a plurality of steps wherein the input sample assembly information 430 is evaluated to assign a consensus call and quality value for the information. In one aspect, the assignment routine 530 commences with formatting the sample assembly information 430 in a columnar manner 540 wherein sequence fragments that may correspond to overlapping regions of the sample sequence are aligned with respect to one another. Each column of sequence information and corresponding quality values are then evaluated and a consensus basecall and quality value are calculated for each base in the sample sequence in state 545. This procedure may be repeated as necessary for each base in the sample sequence. Thereafter, the results of the consensus-calling assignment routine are returned in state 548 where they may be stored or output to the user. The methods of consensus calling and quality value assignment will be discussed in greater detail hereinbelow with reference to subsequent illustrations and figures.
Taken together, the information obtained from the sequence fragments 550 may be used in generating the sample assembly for which consensus basecalls may be made using the aligned data. In one aspect, the information corresponding to a selected base of the sample sequence is processed by identifying a column 575 of sequence fragment information. The information contained in each column is then evaluated using the consensus call assignment routine 530 and a consensus basecall 580 is made. In addition to the consensus basecall 580, a consensus quality value 585 may be calculated and associated with the consensus basecall 580 (shown as a bar in the illustration).
The consensus based approach described above is further illustrated for the columnar sequence 575 where the basecalls 590 for each aligned sequence fragment 550 correspond to ‘A’ with varying degrees of certainty or confidence represented by the associated quality value 595. Taken together these quality values 595 may be used to generate the consensus quality value 585 which in some instances may be greater than the individual quality values 595 due to the increased number of comparisons being made. It will be appreciated by one of skill in the art that using information including quality values from more than one sequence fragment 550 may aid in increasing the overall confidence in the resulting consensus.
Following vector construction in state 610, the method 600 proceeds to assign a base weighting value to each basecall in state 620 which may be based upon the putative base composition. In one aspect, the weighting value is used to transform each basecall in the clear range into a quantifiable value that may be averaged with other basecalls. Averaging in this manner permits identified basecalls to be differentially weighted according to their composition. For example, it may be desirable to differentially weight pure-bases as compared to mixed-bases and gaps so as to influence the calculated quality value determined during sequence assembly. In various embodiments, the base weighting value may be assigned according to a user-defined rule set where identified mixed-bases (including, for example, R, Y, K, M, S, W) may be assigned a value of ‘1’. In a similar manner other mixed-bases (including, for example, H, B, V, D, and N) may be assigned a value of ‘2’. The remaining bases G, A, T, C may be assigned a value of ‘0’.
Following base value assignment in state 620 the method 510 proceeds to state 630 where an averaging of the basecall values in the clear range may be performed. In one aspect, a sliding window averaging approach is utilized to determine the mixed-base noise level 510 comprising an average of the weighted values for basecalls within the sliding window. This averaging operation may be performed for each basecall within the clear range to generate per-base noise values that may be used in later consensus call assignment routines. Application of mixed-base noise levels in consensus call assignment will subsequently be described in greater detail.
Commencing in state 805, sample input is collected which may include the aforementioned sample assembly input 405, as well, as any data and information generated during the preprocessing operations 505. The sequence information is thereafter aligned so as to permit columnar analysis for each basecall position in the sample sequence. In state 807, the basecall information contained in an identified column of sequence information is evaluated for gaps. Sequence gaps occur where a particular base (or mixed-base) cannot be identified with a desired or selected degree of confidence or certainty. Gaps may arise from various sources of sequencing error including mobility shifts and basecalling in areas with poorly resolved peaks. In one aspect, a gap may be representative of a position within a particular sequence fragment 305 where the base for a particular position is not readily identifiable.
In one aspect, the consensus-calling method 800 processes gaps by evaluating the basecall information for other sequence fragments 305 within the identified column. For each gap, a quality value may be assigned to the gap region in state 812. In one aspect, assigning quality values to identified gaps aids in evaluating the gap call confidence using other sequence fragments in consensus comparison of the columnar information. The quality value for the gap may further be assigned by determining the average of adjacent bases that may substantially flank the gap; to thereby yield a value representative numeric of the confidence level for the gap.
Evaluating the identified basecalls in the corresponding consensus column, the number of gaps in the column is then identified in state 809. If the gap number exceeds a pre-selected gap threshold (determined in state 810) then a gap may be called in state 815. The gap threshold may be defined as the ratio of gaps to non-gap (identifiable) bases within the column and may be represented by a numeric, fractional, or percentile value. In one aspect, the gap threshold may be selected to be between approximately 50%-75% wherein if the number of gaps to non-gap calls in the column exceeds the selected gap threshold, a gap is called in the assembled sample sequence.
If the gap threshold is not exceeded in state 810 the method 800 proceeds to state 820 where a base agreement determination is made. In one aspect, the base agreement determination comprises evaluating each of the basecalls for a selected column. In one aspect, if all of the basecalls in the column agree with one another then the basecall may be identified as the consensus call for the column in state 825. For example, if each of the basecalls for the column corresponds to the base ‘G’ then the consensus call for the sample sequence will be assigned as the base ‘G’.
As previously described, a quality value may desirably be associated with each basecall. The quality value may further reflect a calculated degree of confidence with which the basecall is made and, therefore, may serve as an indicator of sample sequence accuracy.
In various embodiments, the quality value for a consensus call wherein all of the bases are in agreement as described above may further be identified by the following equations:
In these equations N represents the number of overlapping sequence fragments for a particular column in the consensus, where both sums are calculated over the sample number. Qcons further represents the quality value assigned to the consensus call and Q represents the quality value for sample. In these equations, εij represents a weight derived from the degree of independence for each of the quality-value estimates. In one aspect, the value εij may be used to avoid over weighting of redundant calls. Redundant calls may arise; for example, in instances where a plurality of sequencing traces undergoing analysis contain regions of substantially elevated degrees of similarity.
Furthermore, γij is a derived parameter defined according to Equation 4 below which generally ranges from approximately 0 to 1. This value may further be used to express the degree to which information from samples i and j are independent. In one aspect, for two sequence fragments aligned in opposite orientations (e.g. 5′-3′: 3′-5′), this parameter may be set to a value of ‘1’ whereas for sequence fragments oriented in substantially the same direction, this parameter may be set to a value of ‘0’. In one aspect, use of this parameter aids in representing the notion that the quality of calls obtained from samples sequenced in the forward and reverse orientations may be uncorrelated.
For two overlapping samples sequenced in the same orientation the following relationship may further be identified:
Here, Δij represents the absolute difference in basecalls between the position of the basecall in sample i and the position of the basecall in sample j. Additionally, Δmax is a parameter that may be used to select how far apart the position in two samples should be before the quality of the two calls may be considered to be statistically independent. In one aspect, the value for Δmax may range from approximately 50 bp.-200 bp. The parameter γi may further be set with a value of ‘1’ if Δij>Δmax.
It is observed that quality value assignment according to the relationships described by Equations 2 and 3 provides several desirable properties. In one aspect, when the quality values for each basecall within a particular column are substantially independent, it may be determined that γij=1 and as a result,
In this instance, the following relationship may be used to represent the consensus quality value:
Qcons=−10 log pcons Equation 5
In this equation pcons reflects the probability that the consensus call is an error and using this information, the consensus quality value may be determined from the logarithmic relationship. In one aspect, the equation has been found to apply when the sequence fragment quality values are statistically uncorrelated and, therefore, may be useful to determine when calculating the consensus quality value.
While this equation may be used in certain circumstances, a potential difficulty may be encountered when applying this relationship to evaluate quality values for redundant sequence information present in a selected column of consensus information. In such an instance, the quality of the consensus call for redundant sequence information may be overestimated without further analysis. For example, when substantially similar electropherogram information corresponding to redundant sequence fragments generated by similar PCR and sequencing primers are included in the same column assembly, there may be a tendency to overestimate the quality value when calculating it according to the aforementioned relationships. Typically, basecalls arising from redundant sequence fragments are highly correlated where a mistake or inaccuracy that has been made in the basecalls for one sample may also be made in the basecalls for a second sample. Therefore, the resultant quality values, if summed, may result in an overestimation of the true quality of the call. The present teachings overcome this tendency towards overestimation by applying an alternative relationship when dealing with highly correlated (e.g., redundant) basecalls.
In one aspect, using the relationship described in Equation 3, substitution of γij=0 into this equation generates a reduced expression
Substitution of this expression into Equation 2 therefore reflects an average of the quality for the sequence fragments without undesirable overestimation. Therefore, use of the independence parameter, γij provides a means to generate a superior estimate for redundant samples using the relationships shown in Equations 2 and 3 as compared to that provided by Equation 4.
In another aspect, the consensus sequence method avoids false negative basecalling in state 825 by screening for the presence of potential mixed-bases that may not have been previously detected or identified in the sample sequence input. False negatives correspond to true mixed-basecalls that may have been identified as pure-bases. False negatives may further generate undesirably high quality values if directly calculated using the relationships shown in Equation 2 and 3. The screening process therefore desirably helps to avoid quality value assignment that may inaccurately reflect the basecall confidence. When possible mixed-bases are found by the screening process, the same pure-basecall may still be called, however, the quality value associated with the call is adjusted accordingly in state 875 to reflect a lower quality or confidence level. In one aspect, adjustment of the quality value for a possible mixed-base depends on the result of the relationship shown in Equation 5 which will be described in greater detail hereinbelow.
For basecalls that are determined not to be in agreement in state 820, the method 800 proceeds to state 830 where a more rigorous evaluation of mixed-bases is performed. In one aspect, it is desirable to perform a rigorous evaluation of mixed-bases in this state 810 in order reduce the number of false positives that might otherwise be called at the single strand level and which might persist during consensus calling. In one aspect, an increased number of false positives persisting in the assembly are undesirable as they may be carried over to the consensus sequence. Therefore, rigorous evaluation of mixed-bases in state 830 is desirable to overcome some of the conventional difficulties encountered when assigning mixed-basecalls.
In many conventional approaches, mixed-basecalls may be resolved by estimating the quality of the mixed-basecall versus a pure-basecall at the same location or column and thereafter selecting the higher quality call. While this approach may produce the correct basecall at the single-strand level (e.g. for single sequence fragments), there is a likelihood of missing difficult or elusive mixed-bases using this approach. In one aspect, conventional analysis methods may improperly call mixed-bases when a single pure-base component predominates over other bases in the electropherogram. In various embodiments, the present teachings overcome this difficulty due, in part, to the availability of the plurality of basecalls for each column. Using the multiplicity of basecalling information, the rigorous evaluation used to identify mixed-bases may produce superior results as compared to conventional methods.
According to state 830, a threshold parameter may be designated between approximately 5-30%. In one aspect, this parameter is selected to aid in the identification of smaller mixed-base peaks while potentially avoiding larger peaks that may be associated with false positive calls. Previously identified mixed-bases may then be recalled and evaluated using the more rigorous approach. In various embodiments, the approach may comprise evaluating the recalled mixed-bases to determine a modified quality value according to the following relationship:
This equation comprises a scaling factor for which previously identified quality values may be multiplied by to generate the modified quality value. In various embodiments, this relationship helps to avoid undesirably high quality value assignment for missed mixed-bases (e.g. false negatives). In one aspect, application of the scaling factor desirably reduces the quality values for pure-bases in the column when a mixed-base has been called.
In this equation, δ and λ represent empirically determined parameters, β0 represents the threshold that the mixed-base was identified at, and β reflects a selected minimum threshold for identifying mixed-bases according to the method 800. Exemplary ranges for each of these parameters correspond approximately to δ=0.1-0.5, λ=0.2-0.9, and β=0.01-0.30. In one aspect, the threshold may be expressed as a percentage of a primary peak height found in the electropherogram. In another aspect, scaling by the factor obtained from Equation 5 desirably reduces the quality value of recalled mixed-bases with smaller secondary peaks, while allowing a relatively high quality value for recalled mixed-bases with higher secondary peaks.
After the mixed-bases are recalled using the more rigorous approach, the basecalls are checked for agreement in state 835. If the basecalls are determined to agree then the consensus basecall may be assigned as the agreeing basecall in state 840 and the quality value may be determined according to Equations 2 and 3 described above.
In those instances where the basecalls compared in state 835 are determined not to agree, the method proceeds through a series of additional operations shown in
In this equation, the summation of basecalls is taken over an identified sample set comprising one or more basecalls to generate a value associated with the conflicting call αi. Additionally, the individual sample vote cj may be defined according to the following relationships depending upon the nature of the basecall. For example, when the basecall is associated with a pure-base, the individual sample vote may be determined according to the following equation:
cj=εQV Equation 7
In this equation, ε is a scaling parameter that may be used to increase the vote weight for pure-bases. In one aspect, the value of ε may be between approximately 1.50-2.00. In one aspect, this equation may be applied to offset the statistical limitation that a quality value assigned by conventional basecalling applications may be excessively high for mixed-basecalls compared to similar quality values for pure-basecalls. Additionally, application of this equation may desirably aid to reduce the false-positive rate of the consensus-calling methodology thereby allowing a lower mixed-base threshold to be used for the single-strand basecalling. Application of this relationship may further be used to moderate the number of false negatives by adjusting the quality values for each basecall to better balance mixed-basecall and pure-basecall quality values
In another aspect, the individual sample vote cj may be defined according to another relationship when the basecall comprises a mixed-base
cj=κQV Equation 8
In this equation, κ is a scaling factor which may be used to decrease the vote weight for mixed-basecalls. In one aspect, this equation may be additionally used to reduce the mixed-base vote in the presence of local mixed-base noise (previously defined above). For example, by defining χ as the local mixed-base noise at the basecall position of interest, then κ=(1−η)[(1−η)(1−χ3/2)+η−ηχ]+η where η is a parameter which approximates how much the vote may be lowered for mixed-basecalls made in the presence of local mixed-base noise. In various embodiments, the value of η may be selected from between approximately 0.1-0.5.
In one aspect, the false positive rate may further be reduced by assigning a weight of (1−κ)QV to the primary peak associated with the mixed-basecall. Furthermore, if the mixed-basecall was identified in state 830 then a weight of approximately 0.5 εQV may additionally be added to the primary peak basecall. This approach may be used to desirably reduce the false positive rate for the rigorous bases calls made in this manner.
The aforementioned summation indicated by Equation 8 may additionally be weighted for the presence of basecalls made in the same orientation. In one aspect, two or more samples in the same orientation may provide redundant estimates of the appropriate basecall. To offset this effect, a differential weighting of the redundant basecalls may be made. For example, the highest quality call in each orientation may be assigned with a weight of approximately ‘1’ and other identical calls in the same orientation may be assigned with a weight of approximately ‘0.5’.
Upon completion of the aforementioned analytical steps, the basecall with the greater or substantially maximal vote may be selected by the consensus-calling routine in state 855. Subsequently in state 860, the consensus calling routine proceeds to perform a basecall validation check. The basecall validation check is used to verify that the consensus basecall is substantially supported by each of the samples in the assembly. In one aspect, if the call is a mixed-basecall this check may involve verifying that each of the samples has at least some degree of signal present at this position in the electropherogram. If it is determined that there is support of the mixed-basecall in each of the samples then the basecall is finalized and a quality value generated in state 865. In one aspect quality value determination for the basecall assignment may proceed according to Equation 9 described in greater detail hereinbelow. In various embodiments the mixed-base validation operation performed in state 860 desirably aids in the determination of whether there is support for the mixed-base consensus call. This operation may additionally serve as a screen to reduce the occurrence of false positives.
In one aspect, mixed-based validation may comprise verifying that each of the sample fragments contributing to the basecall has a signal present at or above a pre-selected level, φ. Furthermore, φ may be selected to be between approximately 1%-10% of the primary electropherogram peak amplitude. As previously described, should the consensus call pass the validation check, a quality value may then be assigned to the consensus call in state 865 according to the following relationship:
In this equation, ζ is a preselected weighting factor that may be used to express the degree of confidence in the selected vote with the summation taken over all of the votes not in favor of the consensus basecall. In one aspect, this relationship may produce a negative value which may further result in the consensus quality value being assigned to a selected minimum default value. In various embodiments, the weighting factor may be determined to represent a value between approximately 1.0-1.5.
If the mixed-base validation of state 860 fails then the method 800 proceeds to state 870 where a basecall intersection is determined. The mixed-basecall intersection may comprise evaluating each of the sample fragment basecalls that contribute to the consensus basecall. In one aspect, the intersection may be defined in a straightforward manner where the intersection between one or more pure-basecalls and one or more mixed-basecalls is determined. For example, in one exemplary comparison an intersection between an ‘A’ basecall (a pure-base) and an ‘M’ basecall (a mixed-base) results in an intersection base of ‘A’. Likewise, the intersection between an ‘S’ basecall (a GIC mixed-base) and a ‘W’ basecall (an A/T mixed-base) produces the empty set. In one aspect, if the intersection is the empty set then the original consensus basecall may be assigned but with a quality value given by Equation 9 with all of the votes reduced by a predefined factor such as 0.5. If the intersection reflects a nonempty value (e.g. a basecall such as A, G, T, or C) then the quality value assigned to this new consensus basecall may be given by Equation 9 without the aforementioned reduction in quality.
In one aspect, the sample processing module 910 may comprise functionality for receiving the input sequence data and/or trace information and thereafter performing operations 925 which may include: Basecalling, Identification of mixed-bases, Quality value assignment, Identification of heterozygous frameshift mutations, Clear range calculation, and Data filtering/Smoothing operations. Similarly, the specimen processing module 915 may comprise functionality, for performing operations 930 associated with sequence assembly to thereafter be used in determination of one or more consensus sequences according to the methods described by the present teachings. The project processing module 920 may then provide functionality for performing post-consensus determination operations 935 including, for example: Aligning the one or more identified consensus sequences, Identifying nucleotide variants, Translating the one or more consensus sequences into amino acid or protein sequences, and Identifying amino acid or protein sequence variants.
Further details of these operations may be found in previous sections of the present teachings and the following commonly assigned U.S. patent applications which are hereby incorporated by reference in their entirety: application Ser. No. 09/658,161 filed Sep. 8, 2000 and entitled: A system and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms; Provisional Application Ser. No. 60/371,641 filed Apr. 10, 2002 and entitled: Method to Detect and Identify Heterozygous Frameshift Mutations Using Direct Sequencing along with its corresponding non-provisional application (Serial Number to be assigned).
It will be appreciated by one of skill in the art that the aforementioned modular arrangement may be executed in numerous different manners such as using more or less modules, performing additional operations, performing some, but not all, of the aforementioned operations, and other variations to be used in sequence analysis applications. As such, these variations are considered but other embodiments of the present teachings.
The above-described teachings present novel methods by which sequence analysis basecalling may be performed. In various embodiments, use of these methods may improve the accuracy of automated systems that are designed for high-throughput sequence analysis. It is conceived that these methods may be adapted for use with numerous sequencing applications including, but not limited to, heterozygote detection, single nucleotide polymorphism analysis, and general sequence assembly tasks. Additionally, these methods may be readily integrated into new and existing sequence processing applications, software, and instrumentation.
In various embodiments, sequence analysis software applications that integrate the methodology described herein yield highly accurate basecalls with a low observed error rate. For example, these methods have been integrated into the SegScape™ software application for variant identification and resequencing (Applied Biosystems, CA) and found to perform with an error rate at or below 1% even with diverse data sets which may include conventionally problematic data such as experimental or systematic noise, dye blobs, and mobility shifts.
Although the above-disclosed embodiments of the present invention have shown, described, and pointed out the fundamental novel features of the invention as applied to the above-disclosed embodiments, it should be understood that various omissions, substitutions, and changes in the form of the detail of the devices, systems, and/or methods illustrated may be made by those skilled in the art without departing from the scope of the present invention. Consequently, the scope of the invention should not be limited to the foregoing description, but should be defined by the appended claims.
All publications and patent applications mentioned in this specification are indicative of the level of skill of those skilled in the art to which this invention pertains. All publications and patent applications are herein incorporated by reference to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference.
1. A basecalling method for predicting the composition of a polynucleotide sequence, the method comprising:
- receiving information for a plurality of sequence fragments comprising basecalls spanning at least a portion of the sample sequence and corresponding quality values indicative of a calculated degree of confidence in the basecalls;
- aligning the plurality of sequence fragments to identify regions of basecall overlap between the sequence fragments; and
- calculating a consensus basecall and a consensus quality value for the regions of basecalling overlap in the aligned sequence fragments wherein the consensus quality values are determined using the quality values for the basecalls of the sequence fragments.
2. The basecalling method of claim 1, wherein mixed-basecalls present in the sequence fragments are evaluated during calculation of the consensus basecall by comparing the basecalls in the regions of basecall overlap to identify one or more constituent pure basecalls.
3. The basecalling method of claim 1, wherein calculation of the consensus basecall further comprises distinguishing a noise-related component for at least one of the mixed-basecalls to thereby resolve the mixed-basecall into one or more constituent pure basecalls.
4. The basecalling method of claim 1, wherein the consensus basecall is calculated by applying a differential weight to each of the quality values for the basecalls of the aligned sequence fragments and thereafter the consensus basecall is assigned as the basecall for the sequence fragments with the greatest overall quality value.
5. The basecalling method of claim 4, wherein calculation of the consensus basecall further comprises identifying the intersection between one or more of the basecalls for the sequence fragments.
6. The basecalling method of claim 5, wherein identification of the intersection between one or more of the basecalls for the sequence fragments is used to resolve mixed-basecalls into one or more constituent pure basecalls.
7. The basecalling method of claim 6, wherein when no intersection between one or more of the basecalls for the sequence fragments is observed then the consensus basecall is assigned as one of the basecalls for the sequence fragments with a reduced quality value.
8. The basecalling method of claim 1, wherein the consensus basecall is determined by identifying an agreeing basecall for the aligned sequence fragments and the consensus basecall is assigned as the agreeing basecall.
9. The basecalling method of claim 8, wherein agreeing basecall identification is performed following re-calling of one or more of the basecalls for the sequence fragments using a more stringent basecalling criteria.
10. The basecalling method of claim 8, wherein when a lack of agreement in the basecalls for the aligned sequence fragments is observed then the consensus basecall is determined using a weighted basecall vote.
11. The basecalling method of claim 1, wherein calculation of the consensus basecall is used to identify heterozygosity between allelic variants of the polynucleotide sequence.
12. The basecalling method of claim 1, wherein calculation of the consensus basecall is used to identify single nucleotide polymorphisms contained within the polynucleotide sequence.
13. The basecalling method of claim 1, wherein calculation of the consensus basecall is used to identify heterogeneous polynucleotide sequence populations.
14. A system for predicting the composition of a polynucleotide sequence, comprising:
- a sample processing module that receives information for a plurality of sequence fragments, the sample processing module providing functionality for identifying basecalls spanning at least a portion of the sample sequence and corresponding quality values indicative of a calculated degree of confidence in the basecalls;
- a specimen processing module that assembles the plurality of sequence fragments to identify regions of basecall overlap between the sequence fragments; and
- a project processing module that calculates a consensus basecall and a consensus quality value for the regions of basecalling overlap in the assembled sequence fragments wherein the consensus quality values are determined using the quality values for the basecalls of the sequence fragments.
15. The system of claim 14, wherein the project processing module evaluates mixed-basecalls present in the sequence fragments during calculation of the consensus basecall.
16. The system claim 14, wherein the project processing module further identifies the consensus basecall by distinguishing a noise-related component for at least one of the mixed-basecalls to thereby resolve the mixed-basecall into one or more constituent pure basecalls.
17. The system claim 14, wherein the project processing module calculates the consensus basecall by applying a differential weight to each of the quality values for the basecalls of the aligned sequence fragments and thereafter assigns the consensus basecall as the basecall for the sequence fragments with the greatest overall quality value.
18. The system of claim 17, wherein the project processing module calculates the consensus basecall by identifying the intersection between one or more of the basecalls for the sequence fragments.
19. The system of claim 18, wherein the project processing module identifies the intersection between one or more of the basecalls for the sequence fragments to resolve mixed-basecalls into one or more constituent pure basecalls.
20. The system of claim 14, wherein the project processing module determines the consensus basecall by identifying an agreeing basecall for the aligned sequence fragments and thereafter assigns the consensus basecall as the agreeing basecall.
21. The system of claim 14, wherein the modules are used to identify heterozygosity between allelic variants of the polynucleotide sequence.
22. The system of claim 14, wherein the modules are used to identify single nucleotide polymorphisms contained within the polynucleotide sequence.
23. The system of claim 14, wherein the modules are used to identify heterogeneous polynucleotide sequence populations.
24. The system of claim 14, wherein the modules comprise a single unified sequence analysis module.
25. The system of claim 14, wherein the modules are integrated into a sequence analysis software application.
International Classification: G06F 19/00 (20060101);