Allelotyping Methods for Massively Parallel Sequencing
In one illustrative embodiment, an allelotyping method may include selecting a plurality of text strings that each represent a nucleotide sequence that was read by a massively parallel sequencing (MPS) instrument, where the nucleotide sequences represented by the selected plurality of text strings each correspond to a particular locus, comparing the selected plurality of text strings to one another to determine an abundance count for each unique text string included in the selected plurality of text strings, and determining one or more alleles for the particular locus by comparing the abundance count for each unique text string included in the selected plurality of text strings to an abundance threshold.
The instant application is a continuation of U.S. patent application Ser. No. 13/952,761, now U.S. Pat. No. 11,468,970, which was filed Jul. 29, 2013, and is hereby incorporated by reference in its entirety. The instant application also claims priority to pending U.S. patent application Ser. No. 14/489,198, which was filed on Sep. 17, 2014, and is hereby incorporated by reference in its entirety.
SEQUENCE LISTINGThe instant application contains a Sequence Listing which has been submitted in ASCII format via EFS-Web and is hereby incorporated by reference in its entirety. Said ASCII copy, created on Oct. 11, 2022, is named 920006-373507_SL.txt and is 3,550 bytes in size.
TECHNICAL FIELDThe present disclosure relates, generally, to allelotyping methods and, more particularly, to allelotyping methods for nucleotide sequence data obtained using massively parallel sequencing (MPS).
BACKGROUNDPolymorphic tandem repeats of nucleotide sequences are found throughout the human genome, and the particular combinations of alleles identified by their multiple repeat sites are sufficiently unique to an individual that these repeating sequences can be used in human or other organism identification. These markers are also useful in genetic mapping and linkage analysis, where the tandem repeat sites may be useful for determining, for example, predisposition for disease. Tandem repeats can be used directly in human identity testing, such as in forensics analysis. There are many types of tandem repeats of nucleic acids, falling under the general term variable number tandem repeats (VNTR). For example, minisatellites and microsatellites are VNTRs, and microsatellites include short tandem repeats (STRs).
One application of tandem repeat analysis is in forensics or human identity testing. In current forensics analyses, highly polymorphic STRs are identified using a deoxyribonucleic acid (DNA) sample from an individual and DNA amplification steps, such as polymerase chain reaction (PCR), to provide amplified samples of partial DNA sequences, or amplicons, from the individual's DNA. The amplicons can then be matched by size (i.e., repeat numbers) to reference databases, such as the sequences stored in national or local DNA databases. For example, amplicons that originate from STR loci can be matched to reference STR databases, including the Federal Bureau of Investigation (FBI) Combined DNA Index System (CODIS) database in the United States, or the National DNA Database (NDNAD) in the United Kingdom, to identify the individual by matching to the STR alleles specific to that individual.
Forensic DNA analysis is about to cross a threshold where DNA samples will begin to be analyzed routinely by massively parallel sequencing (MPS), also sometimes referred to in the art as next-generation sequencing. The advent of routine MPS for forensic DNA analysis will create large quantities of nucleotide sequence data that may enable richer exploitation of DNA in forensic applications. Once information is generated on the genetic profile of an individual (e.g., for either forensic investigative purposes or confirmatory matching), the resulting nucleotide sequence data should be formatted for exchange among law enforcement entities. Moreover, forensic analysis requires the preservation of data, including raw data, for evidentiary purposes. Data files created by MPS workflows are typically larger than 1 GB, making them difficult to transmit or store. In addition, these files, while text-based, are not human-readable in any practical sense because of their large size. Thus, other human readable forms of the data from MPS workflows are needed.
SUMMARYAccording to one aspect, an allelotyping method may comprise selecting a plurality of text strings that each represent a nucleotide sequence that was read by a massively parallel sequencing (MPS) instrument, where the nucleotide sequences represented by the selected plurality of text strings each correspond to a particular locus, comparing the selected plurality of text strings to one another to determine an abundance count for each unique text string included in the selected plurality of text strings, and determining one or more alleles for the particular locus by comparing the abundance count for each unique text string included in the selected plurality of text strings to an abundance threshold.
In some embodiments, determining the one or more alleles for the particular locus may comprise identifying one or more unique text strings that each represent a nucleotide sequence containing a short tandem repeat (STR). In other embodiments, determining the one or more alleles for the particular locus may comprise identifying one or more unique text strings that each represent a nucleotide sequence containing a single nucleotide polymorphism (SNP).
In some embodiments, comparing the abundance count for each unique text string included in the selected plurality of text strings to the abundance threshold may comprise identifying the unique text string having a highest abundance count from among the selected plurality of text strings. Comparing the abundance count for each unique text string included in the selected plurality of text strings to the abundance threshold may further comprise calculating whether a ratio of the abundance count for each unique text string compared to the highest abundance count exceeds the abundance threshold. The abundance threshold may be a percentage value in the range of 15% to 60%. The abundance threshold may be a percentage value that is configurable by a user.
In some embodiments, the allelotyping method may further comprise receiving a first text-based computer file comprising a plurality of text strings that each represent a nucleotide sequence that was read by the MPS instrument, prior to selecting the plurality of text strings for which the represented nucleotide sequences each correspond to the particular locus, and generating a second text-based computer file comprising each unique text string for which the ratio exceeds the abundance threshold, where a file size of the second text-based computer file is smaller than a file size of the first text-based computer file. The second text-based computer file may comprise one or more unique text strings that each represent a nucleotide sequence determined to be an allele for the particular locus. The file size of the second text-based computer file may be at least ten-thousand times smaller than the file size of the first text-based computer file.
In some embodiments, the steps of (i) selecting the plurality of text strings for which the represented nucleotide sequences each correspond to a particular locus, (ii) comparing the selected plurality of text strings to one another to determine the abundance count for each unique text string included in the selected plurality of text strings, and (iii) determining one or more alleles for the particular locus by comparing the abundance counts to the abundance threshold may be performed for each of a plurality of loci present in a sample that was read by the MPS instrument. Selecting the plurality of text strings for which the represented nucleotide sequences each correspond to one of the plurality of loci may comprise determining whether each of a plurality of text strings generated by the MPS instrument when reading the sample includes text characters that represent a flanking nucleotide sequence associated with a particular locus and selecting a plurality of text strings that include the text characters that represent the flanking nucleotide sequence associated with the particular locus.
In some embodiments, the allelotyping method may further comprise removing the text characters that represent the flanking nucleotide sequence from each of the selected plurality of text strings prior to comparing the selected plurality of text strings to one another to determine the abundance count for each unique text string included in the selected plurality of text strings. The allelotyping method may further comprise removing all text characters that do not represent a short tandem repeat (STR) from each of the selected plurality of text strings prior to comparing the selected plurality of text strings to one another to determine the abundance count for each unique text string included in the selected plurality of text strings.
According to another aspect, a computer-readable medium may store a text-based computer file comprising one or more records. Each of the one or more records may include a first text line, a second text line comprising a text string representing a nucleotide sequence containing a single nucleotide polymorphism (SNP), a third text line comprising a human-readable allele designation for the SNP, and a fourth text line.
In some embodiments, the human-readable allele designation of the third text line may comprise a number of attribute-value pairs that specify a first SNP state, a second SNP state, an abundance count of the first SNP state, an abundance count of the second SNP state, and a strand of the nucleotide sequence represented in the second text line. The human-readable allele designation of the third text line may further comprise an attribute-value pair specifying a reference SNP identifier associated with the nucleotide sequence represented in the second text line.
In some embodiments, the first text line may comprise a unique sequence identifier created by a massively parallel sequencing (MPS) instrument when generating data related to the nucleotide sequence represented in the second text line. The first text line may further comprise forensic metadata specifying one or more of a file format, a unique case identifier, a unique sample identifier, a unique laboratory identifier, and a unique technician identifier. The fourth text line may comprise a text string representing quality scores associated with the nucleotide sequence represented in the second text line.
The concepts described in the present disclosure are illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. For example, the dimensions of some elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements. The detailed description particularly refers to the accompanying figures in which:
While the concepts of the present disclosure are susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the figures and will be described herein in detail. It should be understood, however, that there is no intent to limit the concepts of the present disclosure to the particular forms disclosed, but to the contrary, the intention is to cover all modifications, equivalents, and alternatives consistent with the present disclosure and the appended claims.
References in the specification to “one embodiment,” “an embodiment,” “an illustrative embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may or may not necessarily include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
The disclosed embodiments may be implemented, in some cases, in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on a transitory or non-transitory computer-readable storage medium, which may be read and executed by one or more processors. A computer-readable storage medium may be embodied as any storage device, mechanism, or other physical structure for storing or transmitting information in a form readable by a computing device (e.g., a volatile or non-volatile memory, a media disc, or other media device).
In the drawings, some structural or method features may be shown in specific arrangements and/or orderings. However, it should be appreciated that such specific arrangements and/or orderings may not be required. Rather, in some embodiments, such features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of a structural or method feature in a particular figure is not meant to imply that such feature is required in all embodiments and, in some embodiments, may not be included or may be combined with other features.
The present disclosure relates to methods of allelotyping loci using an approach based on matching and sorting of the reads generated by an MPS instrument when analyzing a sample. STR loci exist in multiple states, called alleles. STR alleles always differ from one another by nucleotide sequence. In addition, alleles often differ from one another by sequence length. Allelotyping is the process of identifying the one or more alleles present at a particular STR locus in a given sample (by way of example, for a sample from one human, the one allele where the locus is homozygous or the two alleles where the locus is heterozygous). The FBI and other law enforcement agencies around the world have selected about twenty-four specific STR loci for use in forensic DNA analysis applications. These STR loci have been selected because they are highly polymorphic, i.e., multiple alleles exist within the relevant populations. For example, the thirteen STR loci considered by the FBI as their “core” loci each exhibit eight to ten different common allelic forms.
A typical MPS process starts with a DNA sample, which is typically amplified using PCR (however, it will be appreciated that certain DNA samples can also be directly sequenced by MPS instruments without pre-amplification). The output of the MPS process typically takes the form of one or more computer files containing text strings representing the nucleotide sequence that were read by the MPS instrument. The parallel nature of MPS results in numerous replicates of the nucleotide sequence of each allele. This is particularly true when the DNA is pre-amplified by PCR, where the number of replicates of each allelic sequence may number in the tens of thousands. However, both the PCR pre-amplification process and the sequencing process itself have non-trivial error rates. Errors are realized as incorrect nucleotide sequences in many of the copies of the original DNA sequences, i.e., the “true” alleles that existed in the original DNA sample. These sequence errors may act to obscure the nucleotide sequences of the “true” alleles.
The diversity of unique nucleotide sequences generated by a PCR-MPS workflow can be illustrated using an experiment performed on a DNA sample from an anonymous human subject. This anonymous human subject was known to contain allele numbers 9 and 10, (with known sequences [AGAT]9 (SEQ ID NO: 1) and [AGAT]10 (SEQ ID NO: 2), respectively) at the CSF1PO locus. The MPS instrument returned approximately 20,000 replicate sequences for the two alleles at this locus (i.e., an average of 10,000 replicate sequences per allele). In the absence of error, only these two “true” allele nucleotide sequences would be returned in the data generated by the MPS instrument. However, because of error in the PCR amplification and MPS processes, a total of 598 unique nucleotide sequences were observed. Two correspond to the “true” alleles, and the remaining 596 correspond to method artifacts and sequencing errors.
Referring now to
The allelotyping method 100 begins with block 102 in which a number of text strings that each represent a nucleotide sequence that was read by an MPS instrument are received. In these text strings, each character represents one base of a nucleotide sequence read by the MPS instrument. As noted above, due to the parallel nature of the MPS process, one sample may result in tens or hundreds of thousands of reads and, hence, associated text strings. Many MPS instruments output this data in files that follow either the FASTA or FASTQ format. As such, in some embodiments of the allelotyping method 100, block 102 may involve receiving such a FASTA or FASTQ file and reading a number of text strings that each represent a nucleotide sequence from that file. In other embodiments, block 102 may involve receiving files in different formats (other than FASTA or FASTQ).
After block 102, the allelotyping method 100 proceeds to block 104 in which a bioinformatic matching procedure is used to determine which reads from the MPS instrument (represented by the received text strings) correspond to one or more particular loci present in the sample. In the illustrative embodiment of block 104, each of the received text strings is evaluated to determine whether it includes characters that represent a known flanking nucleotide sequence associated with a particular locus. The nucleotide sequence of each STR locus is generally flanked, on both the 5′ and 3′ ends, by particular flanking nucleotide sequences. These complex nucleotide sequences can be rare or unique in the DNA analyzed by the MPS instrument. As such, the presence of one (or both) of these flanking nucleotide sequences may indicate the presence of an STR locus in the nucleotide sequence represented by the text string. It is contemplated that, in other embodiments, the text strings may be searched for other specific nucleotide sequences (e.g., the 5′ primer sequences used to PCR amplify each STR locus).
The flanking nucleotide sequences, or other specific nucleotide sequences, used in block 104 may be any length. Longer sequences are more likely to be unique, but increasing length should be balanced against the slower bioinformatic processing inherent to long-sequence matching routines. Moreover, the longer a sequence, the more likely that reads which should contain the sequence will contain an error, reducing the sequence's value as an identifier. In some embodiments, block 104 may reference an external, updatable library of flanking nucleotide sequences (or other specific nucleotide sequences). The text string matching routine used in block 104 can be stringent (e.g., exact character matching) or flexible (e.g., close, overall matching). Block 104 may use any number of bioinformatic approaches, including, but not limited to, string matching routines built into common programming languages (e.g., Java, C++, Perl or Python), regular expression routines built into some programming languages (e.g., Perl) or available as modules for other programming languages, or the like. In some embodiments of block 104, each text string determined to correspond to a particular locus may be grouped or segregated by locus. In other embodiments, each text string determined to correspond to a particular locus may be metadata tagged with that particular locus. In some embodiments, text strings that are not identified as corresponding to any locus (e.g., where the text string does not contain a known flanking nucleotide sequence) may be discarded.
After block 104, the allelotyping method 100 proceeds to block 106 in which the text strings representing nucleotide sequences associated with a particular locus are selected for further processing. In other words, blocks 106-114 of the allelotyping method 100 are performed on a locus-by-locus basis. As indicated in the illustrative embodiment of
After block 106, the allelotyping method 100 proceeds to block 108 in which each of the text strings selected in block 106 are trimmed to remove characters that do not represent the nucleotide sequence of interest. For instance, in some embodiments, all characters except those representing an STR may be removed from each of the selected test strings. In other embodiments, all characters except those representing a nucleotide sequence containing a SNP may be removed from each of the selected test strings. In the illustrative embodiment of block 108, the characters representing the 5′ and 3′ flanking nucleotide sequences (used to identify the text string in block 104), as well as any characters outside of the characters representing the 5′ and 3′ flanking nucleotide sequences, are removed from each of the selected test strings. After block 108, the text strings contain only the alleles proper, but remain identified with a particular locus (due to the grouping, metadata tagging, or other segregation procedure performed in block 104). It is contemplated that, in some embodiments of the allelotyping method 100, block 108 may be omitted, keeping the characters that do not represent the nucleotide sequence of interest but ignoring those characters in blocks 110-114.
After block 108, the allelotyping method 100 proceeds to block 110 in which each unique text string included in the selected text strings (associated with a particular locus) are grouped and counted. In other words, block 110 determines an abundance count for each nucleotide sequence represented by the selected text strings. In the illustrative embodiment, block 110 involves comparing the selected text strings to one another to determine each unique text string and its abundance in the group of selected text strings. Block 110 may be performed by applying a stringent text string matching routine (e.g., exact character matching) to the selected text strings. As noted above, it is contemplated that string matching routines built into common programming languages (e.g., Java, C++, Perl or Python), regular expression routines built into some programming languages (e.g., Perl) or available as modules for other programming languages, or the like. Block 110 may also involve sorting the unique text strings, each representing a unique nucleotide sequence read by the MPS instrument, into a list according to their abundance counts.
It will be appreciated that the exact-matching scheme in block 110 will result in an abundance count (i.e., a number of occurrences) for each even slightly different nucleotide sequence. For instance, where a text string represents a nucleotide sequence containing a SNP, the allelotyping method 100 will independently count this unique text string (and will not group it with a different text string representing a similar nucleotide sequence not exhibiting the SNP). This approach is quite different from prior alignment-based allelotyping methods, which attempt to identify which reference sequence each portion of nucleotide sequence data most resembles.
After block 110, the allelotyping method 100 proceeds to block 112 in which the abundance count for each unique text string determined in block 110 is compared to one or more abundance thresholds. As noted, while the MPS workflow results in many unique nucleotide sequences for a particular locus, one or two of these nucleotide sequences will be the “true” alleles for that particular locus (in the case of a sample from one human). As explained further below, comparison of each of the abundance counts determined in block 110 to an abundance threshold may allow determination of the one or more alleles for that particular locus. In the illustrative embodiment, block 112 involves identifying the unique text string having the highest abundance count from among the selected text strings. The nucleotide sequence represented by this text string will be one allele for the associated locus, because despite the non-trivial error rates, PCR and MPS workflows exhibit sufficient fidelity to generate correct sequences in abundance. The illustrative embodiment of block 112 also involves calculating a ratio of the abundance count for each unique text string as compared to that highest abundance count. It may then be determined whether the ratio calculated for each unique text string exceeds the abundance threshold(s). In some embodiments of block 112, each abundance threshold may be a predetermined percentage value. For instance, the abundance threshold may be a percentage value in the range of 15% to 60%, as explained further below. In other embodiments, the abundance threshold may be a percentage value that is configurable by a user of the allelotyping method 100.
After block 112, the allelotyping method 100 proceeds to block 114 in which the one or more alleles for the particular locus being considered are determined based on the comparisons performed in block 112. By way of example, where the sample analyzed by the MPS instrument is known to contain DNA from a single human source, the abundance threshold used in block 112 may be set to a percentage value in the range of about 50% to about 60%. It is known that, for sister alleles at a heterozygous locus, the lesser abundant allele will typically have a ratio of about 50-60% or greater to the more abundant allele (i.e., the allele represented by the text string with the highest abundance count). As such, if a second text string is determined in block 112 to have an abundance count exceeding about 50-60% of the highest abundance count, block 114 may conclude that the locus is heterozygous and that the second text string represents the lesser abundant sister allele. Alternatively, if no text string is determined in block 112 to have an abundance count exceeding about 50-60% of the highest abundance count, block 114 may conclude that the locus is homozygous and that the text string with the highest abundance count represents the only allele at that particular locus. As another example, in cases where it is unknown if the sample analyzed by the MPS instrument contained DNA from multiple sources, the abundance threshold used in block 112 may be set to a percentage value of about 15% (or higher). This abundance threshold may capture alleles from a secondary source that is present in a lower amount in the sample, while avoiding detection of artifacts caused by PCR stutter (which tend to be about 4-15% of the corresponding allele abundance count). In yet another embodiment, the abundance threshold might be set to a percentage value of about 4% to capture PCR stutter artifacts but avoid background noise (which tends to be <4% of the highest abundance count).
Referring now to
The method 400 is intended for use in the forensic DNA analysis process to create an SEF file 410 that may serve as an evidentiary record. As shown in
The SEF file creator 408 may perform processing on the raw nucleotide sequence data ingested from the FASTQ file 404 when generating the SEF file 410. In the illustrative embodiment of the method 400, the SEF file creator 408 performs the allelotyping method 100 on the nucleotide sequence data in the FASTQ file 404. In this embodiment, block 102 of the allelotyping method 100 involves receiving the FASTQ file 404, which contains a number of text strings that each represent a nucleotide sequence read by the MPS instrument 402. Upon completion of the allelotyping method 100, the SEF file creator 408 may write each unique text string determined in block 112 of the allelotyping method 100 to exceed the abundance threshold to the SEF file 410. In some embodiments, the SEF file creator 408 may write each unique text string determined in block 114 of the allelotyping method 100 to represent an allele for a particular locus to the SEF file 410.
It will be appreciated that, as a result of the allelotyping method 100, the SEF file 410 generated by the SEF file creator 408 may have a file size that is significantly smaller than that of the FASTQ file 404. While the FASTQ file 404 contains an individual, four-line record for every read performed by the MPS instrument 402, the SEF file 410 contains only one four-line record for each “true” allele in the sample (and any other nucleotide sequences of interest). As such, using the allelotyping method 100, the SEF file creator 408 is able to “compress” the FASTQ file 404 into the SEF file 410 while retaining all of the information important for forensic analysis. Table 1 below sets forth five examples of FASTQ files 404 that were processed by the SEF file creator 408 to generate SEF files 410. As can be seen from these five examples, the allelotyping method 100 can produce an SEF file 410 with a file size that is at least 10,000 times smaller than the file size of the source FASTQ file 404.
Several records of one illustrative embodiment of an SEF file 410 are illustrated in
Similar to FASTQ files, the SEF file 410 of
In some embodiments, the forensic metadata included in the first text line of each record of the SEF file 410 may be in the form of one or more attribute-value pairs 500. In the illustrative embodiment of
In the illustrative embodiment of
While certain illustrative embodiments have been described in detail in the figures and the foregoing description, such an illustration and description is to be considered as exemplary and not restrictive in character, it being understood that only illustrative embodiments have been shown and described and that all changes and modifications that come within the spirit of the disclosure are desired to be protected. There are a plurality of advantages of the present disclosure arising from the various features of the methods, systems, and articles described herein. It will be noted that alternative embodiments of the methods, systems, and articles of the present disclosure may not include all of the features described yet still benefit from at least some of the advantages of such features. Those of ordinary skill in the art may readily devise their own implementations of the methods, systems, and articles that incorporate one or more of the features of the present disclosure.
Claims
1. An allelotyping method comprising:
- selecting a plurality of text strings that each represent a nucleotide sequence that was read by a massively parallel sequencing (MPS) instrument, wherein the nucleotide sequences represented by the selected plurality of text strings each correspond to a particular locus;
- comparing the selected plurality of text strings to one another to determine an abundance count for each unique text string included in the selected plurality of text strings; and
- determining one or more alleles for the particular locus by comparing the abundance count for each unique text string included in the selected plurality of text strings to an abundance threshold.
2. The allelotyping method of claim 1, wherein determining the one or more alleles for the particular locus comprises identifying one or more unique text strings that each represent a nucleotide sequence containing a short tandem repeat (STR).
3. The allelotyping method of claim 1, wherein determining the one or more alleles for the particular locus comprises identifying one or more unique text strings that each represent a nucleotide sequence containing a single nucleotide polymorphism (SNP).
4. The allelotyping method of claim 1, wherein comparing the abundance count for each unique text string included in the selected plurality of text strings to the abundance threshold comprises identifying the unique text string having a highest abundance count from among the selected plurality of text strings.
5. The allelotyping method of claim 4, wherein comparing the abundance count for each unique text string included in the selected plurality of text strings to the abundance threshold further comprises calculating whether a ratio of the abundance count for each unique text string compared to the highest abundance count exceeds the abundance threshold.
6. The allelotyping method of claim 5, wherein the abundance threshold is a percentage value in the range of 15% to 60%.
7. The allelotyping method of claim 5, wherein the abundance threshold is a percentage value configurable by a user.
8. The allelotyping method of claim 5, further comprising:
- receiving a first text-based computer file comprising a plurality of text strings that each represent a nucleotide sequence that was read by the MPS instrument, prior to selecting the plurality of text strings for which the represented nucleotide sequences each correspond to the particular locus; and
- generating a second text-based computer file comprising each unique text string for which the ratio exceeds the abundance threshold, wherein a file size of the second text-based computer file is smaller than a file size of the first text-based computer file.
9. The allelotyping method of claim 8, wherein the second text-based computer file comprises one or more unique text strings that each represent a nucleotide sequence determined to be an allele for the particular locus.
10. The allelotyping method of claim 8, wherein the file size of the second text-based computer file is at least ten-thousand times smaller than the file size of the first text-based computer file.
11. The allelotyping method of claim 1, wherein the steps of (i) selecting the plurality of text strings for which the represented nucleotide sequences each correspond to a particular locus, (ii) comparing the selected plurality of text strings to one another to determine the abundance count for each unique text string included in the selected plurality of text strings, and (iii) determining one or more alleles for the particular locus by comparing the abundance counts to the abundance threshold are performed for each of a plurality of loci present in a sample that was read by the MPS instrument.
12. The allelotyping method of claim 11, wherein selecting the plurality of text strings for which the represented nucleotide sequences each correspond to one of the plurality of loci comprises:
- determining whether each of a plurality of text strings generated by the MPS instrument when reading the sample includes text characters that represent a flanking nucleotide sequence associated with a particular locus; and
- selecting a plurality of text strings that include the text characters that represent the flanking nucleotide sequence associated with the particular locus.
13. The allelotyping method of claim 12, further comprising removing the text characters that represent the flanking nucleotide sequence from each of the selected plurality of text strings prior to comparing the selected plurality of text strings to one another to determine the abundance count for each unique text string included in the selected plurality of text strings.
14. The allelotyping method of claim 1, further comprising removing all text characters that do not represent a short tandem repeat (STR) from each of the selected plurality of text strings prior to comparing the selected plurality of text strings to one another to determine the abundance count for each unique text string included in the selected plurality of text strings.
15. A computer-readable medium storing a text-based computer file, the text-based computer file comprising:
- one or more records, each of the one or more records including: a first text line; a second text line comprising a text string representing a nucleotide sequence containing a single nucleotide polymorphism (SNP); a third text line comprising a human-readable allele designation for the SNP; and a fourth text line.
16. The computer-readable medium of claim 15, wherein, for each of the one or more records of the text-based file, the human-readable allele designation of the third text line comprises a number of attribute-value pairs that specify a first SNP state, a second SNP state, an abundance count of the first SNP state, an abundance count of the second SNP state, and a strand of the nucleotide sequence represented in the second text line.
17. The computer-readable medium of claim 15, wherein, for each of the one or more records of the text-based file, the human-readable allele designation of the third text line further comprises an attribute-value pair specifying a reference SNP identifier associated with the nucleotide sequence represented in the second text line.
18. The computer-readable medium of claim 15, wherein, for each of the one or more records of the text-based file, the first text line comprises a unique sequence identifier created by a massively parallel sequencing (VIPS) instrument when generating data related to the nucleotide sequence represented in the second text line.
19. The computer-readable medium of claim 18, wherein, for each of the one or more records of the text-based file, the first text line further comprises forensic metadata specifying one or more of a file format, a unique case identifier, a unique sample identifier, a unique laboratory identifier, and a unique technician identifier.
20. The computer-readable medium of claim 15, wherein, for each of the one or more records of the text-based file, the fourth text line comprises a text string representing quality scores associated with the nucleotide sequence represented in the second text line.
Type: Application
Filed: Oct 11, 2022
Publication Date: Jun 22, 2023
Inventors: Brian A. Young (Columbus, OH), Angela T. Minard-Smith (Marysville, OH), Esley M. Heizer, Jr. (Galloway, OH), Daniel M. Bornman (Powell, OH), Mark E. Hester (Reynoldsburg, OH), Boyu Yang (Charlottesville, VA)
Application Number: 17/963,732