Systems and Methods for Detection of Aneuploidy

Info

Publication number: 20180173846
Type: Application
Filed: Feb 2, 2018
Publication Date: Jun 21, 2018
Applicant: Natera, Inc. (San Carlos, CA)
Inventors: Styrmir SIGURJONSSON (San Jose, CA), Naresh VANKAYALAPATI (San Francisco, CA), Allison RYAN (Belmont, CA), Zachary DEMKO (San Francisco, CA), Milena BANJEVIC (Los Altos Hills, CA)
Application Number: 15/887,914

Abstract

Provided herein are improved methods for detecting aneuploidy in a sample. The methods in certain embodiments are used for the analysis of circulating DNA in serum samples, such as circulating fetal DNA or circulating tumor DNA. In certain embodiments, chromosome or chromosome segments of interest are used to set a bias model and/or a control value for a z-score determination, in illustrative examples without the use of a control chromosome.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. Utility application Ser. No. 14/732,632, filed on Jun. 5, 2015. U.S. Utility Ser. No. application 14/732,632 claims the benefit of and priority to U.S. Provisional Application Ser. No. 62/008,235, filed Jun. 5, 2014; U.S. Provisional Application Ser. No. 62/032,785, filed Aug. 4, 2014; and U.S. Provisional Application Serial No. 62/079,257, filed Nov. 13, 2014. The entireties of all these applications are each hereby incorporated by reference for the teachings therein.

FIELD OF THE INVENTION

The present invention generally relates to molecular biology methods and systems, and more specifically to methods and systems for detecting ploidy of a chromosome segment.

BACKGROUND

Measurement of the number of copies of a chromosome or chromosome segment in a cell of interest is an important technique in molecular biology. The technique has wide applicability in fields such as prenatal diagnosis and the analysis of cancer cells. Older techniques such as karyotyping are being supplanted by techniques employing high levels of DNA sequencing. For example, such techniques can be used to detect copy number variation (CNV).

Copy number variation (CNV) has been identified as a major cause of structural variation in the genome, involving both duplications and deletions of sequences that typically range in length from 1,000 base pairs (1 kb) to 20 megabases (mb). Deletions and duplications of chromosome segments or entire chromosomes are associated with a variety of conditions, such as susceptibility or resistance to disease.

CNVs are often assigned to one of two main categories, based on the length of the affected sequence. The first category includes copy number polymorphisms (CNPs), which are common in the general population, occurring with an overall frequency of greater than 1%. CNPs are typically small (most are less than 10 kilobases in length), and they are often enriched for genes that encode proteins important in drug detoxification and immunity. A subset of these CNPs is highly variable with respect to copy number. As a result, different human chromosomes can have a wide range of copy numbers (e.g., 2, 3, 4, 5, etc.) for a particular set of genes. CNPs associated with immune response genes have recently been associated with susceptibility to complex genetic diseases, including psoriasis, Crohn's disease, and glomerulonephritis.

The second class of CNVs includes relatively rare variants that are much longer than CNPs, ranging in size from hundreds of thousands of base pairs to over 1 million base pairs in length. In some cases, these CNVs may have arisen during production of the sperm or egg that gave rise to a particular individual, or they may have been passed down for only a few generations within a family. These large and rare structural variants have been observed disproportionately in subjects with mental retardation, developmental delay, schizophrenia, and autism. Their appearance in such subjects has led to speculation that large and rare CNVs may be more important in neurocognitive diseases than other forms of inherited mutations, including single nucleotide substitutions.

Gene copy number can be altered in cancer cells. For instance, duplication of Chrlp is common in breast cancer, and the EGFR copy number can be higher than normal in non-small cell lung cancer. Cancer is one of the leading causes of death; thus, early diagnosis and treatment of cancer is important, since it can improve the patient's outcome (such as by increasing the probability of remission and the duration of remission). Early diagnosis can also allow the patient to undergo fewer or less drastic treatment alternatives. Many of the current treatments that destroy cancerous cells also affect normal cells, resulting in a variety of possible side-effects, such as nausea, vomiting, low blood cell counts, increased risk of infection, hair loss, and ulcers in mucous membranes. Thus, early detection of cancer is desirable since it can reduce the amount and/or number of treatments (such as chemotherapeutic agents or radiation) needed to eliminate the cancer.

Copy number variation has also been associated with severe mental and physical handicaps, and idiopathic learning disability. Non-invasive prenatal testing (NIPT) using cell-free DNA (cfDNA) can be used to detect abnormalities, such as fetal trisomies 13, 18, and 21, triploidy, and sex chromosome aneuploidies. Subchromosomal microdeletions, which can also result in severe mental and physical handicaps, are more challenging to detect due to their smaller size. Eight of the microdeletion syndromes have an aggregate incidence of more than 1 in 1000, making them nearly as common as fetal autosomal trisomies.

In addition, a higher copy number of CCL3L1 has been associated with lower susceptibility to HIV infection, and a low copy number of FCGR3B (the CD16 cell surface immunoglobulin receptor) can increase susceptibility to systemic lupus erythematosus and similar inflammatory autoimmune disorders.

Thus, improved methods are needed to detect deletions and duplications of chromosome segments or entire chromosomes. Preferably, these methods can be used to more accurately diagnose disease or an increased risk of disease, such as cancer or CNVs in a gestating fetus.

In many clinical trials concerning a diagnostic that employs molecular biology, for example for detecting CNVs, a protocol with a number of parameters is set, and then the same protocol is executed with the same parameters for each of the patients in the trial. In the case of determining the ploidy status of a fetus gestating in a mother using sequencing as a method to measure genetic material one pertinent parameter is the number of reads. The number of reads may refer to the number of actual reads, the number of intended reads, fractional lanes, full lanes, or full flow cells on a sequencer. In these studies, the number of reads is typically set at a level that will ensure that all or nearly all of the samples achieve the desired level of accuracy. Sequencing is currently an expensive technology, a cost of roughly $200 per 5 mappable million reads, and while the price is dropping, any method which allows a sequencing based diagnostic to operate at a similar level of accuracy but with fewer reads will necessarily save a considerable amount of money.

Accordingly, there is a need for new improved techniques for the determination of aneuploidy in a chromosome or chromosome segment of interest, especially by employing DNA sequencing in a more accurate and cost-effective manner by reducing the required number of reads. This will bring down the cost of such molecular diagnostics, resulting in better diagnostics that are available to more people. The improved techniques would for example, be particularly valuable in the analysis of cell free DNA derived from fetal cells or tumor cells to provide improved prenatal and cancer diagnostics.

SUMMARY

Provided herein in one embodiment are methods and systems for determining the copy number, or detecting aneuploidy of a chromosome or chromosome segment of interest in a cell of interest that are performed using the chromosome or chromosome segment of interest to set a bias model, that is to set test parameters, using samples analyzed in the same parallel analysis, that are identified as diploid samples with high confidence, for the analysis of aneuploidy for the same chromosome or chromosome segment of interest of other sample(s) in the set of on-test samples. Accordingly, in one example of this embodiment, provided herein is a method for determining a presence or absence of aneuploidy of a chromosome or chromosome segment of interest in a test sample, that includes the following steps:

- a) obtaining genetic data for the chromosome or chromosome segment of interest from each sample of a set of samples that includes the test sample and at least one diploid sample, wherein the genetic data is obtained from a parallel analysis of the set of samples;
- b) setting a bias model using the genetic data for the chromosome or chromosome segment of interest in the diploid sample determined to be disomic for the chromosome or chromosome segment of interest;
- c) adjusting the genetic data for the chromosome or chromosome segment of interest for the test sample using the bias model; and
- d) establishing the presence or absence of aneuploidy for the chromosome or chromosome segment of interest in the test sample using the normalized data.

In certain illustrative examples of this embodiments, the at least one diploid sample is determined to be disomic for the chromosome or chromosome segment of interest by analyzing the genetic data from the parallel analysis. In certain illustrative examples, the diploid sample is determined to be disomic (i.e. selected as being disomic) for the chromosome or chromosome segment of interest without using a control chromosome or control chromosome segment.

In certain examples of this embodiment of the invention, one or two maximum likelihood analysis are used to carry out the method. As disclosed above, the first maximum likelihood method can be used to identify diploid samples in the set of samples and to determine a first probability that the other samples in the set of samples are aneuploidy. Accordingly, in certain embodiments, one or more or all of the chromosome(s) or chromosome segment(s) of interest are determined to be disomic using a first maximum likelihood method. The method includes the following steps: creating, for each sample in the set of samples, a plurality of first hypotheses wherein each first hypothesis is associated with a specific copy number for the chromosome or chromosome segment of interest, determining a first probability value for each first hypothesis, wherein the first probability value indicates the likelihood that the sample has the number of copies of the chromosome or chromosome segment that is associated with the first hypothesis, wherein the first probability values are derived from the genetic data associated with the sample, and selecting at least one diploid sample by selecting those one or more samples that most closely match a disomic copy number hypothesis for the chromosome or chromosome segment of interest, with at least a minimum level of confidence. That is, by selecting those samples that yield the highest probability of being disomic for the chromosome or chromosome segment of interest.

In certain embodiments, the method includes at least two maximum likelihood analysis, the presence or absence of aneuploidy, or the number of copies of the chromosome or chromosome segment of interest, is determine by creating a plurality or set of 2^ndhypotheses, also called ploidy hypotheses herein, wherein each 2^ndhypothesis is associated with a specific copy number of the chromosome or chromosome segment of interest in the target cell. The models are then used to test how well the genetic data from each patient fits each 2^ndhypothesis. The goodness of fit for each 2^ndhypothesis is determined. A second probability value is calculated for each second hypothesis wherein the second probability value indicates the likelihood that the genome of the target cell has the number of chromosomes or chromosome segments that is specified by the second hypothesis. Thus by selecting the 2^ndhypothesis with the maximum likelihood, one may determine the copy number for the chromosome or chromosome segment in the genome of the target cell. Such first and second hypothesis can be considered in combination to increase the confidence of the aneuploidy determination

In another embodiment, provide herein is a method for determining a presence or absence of aneuploidy for a first chromosome or chromosome segment of interest in a test sample from a test subject, includes the following steps: obtaining genetic sequencing data from a parallel analysis of the first chromosome or chromosome segment of interest from cell free DNA from each sample in a set of liquid samples comprising the test sample, wherein the set of liquid samples comprises at least 3 samples and wherein the genetic sequencing data determines an amount of DNA corresponding to each locus in a first set of loci present on the first chromosome or chromosome segment of interest respectively; selecting a diploid subset of samples from the set of liquid samples, wherein the diploid subset of samples are samples that are initially determined to be disomic for the first chromosome or chromosome segment of interest using an initial bias model, wherein the subset of samples comprises at least 2 samples; setting a confirmatory bias model from the genetic data from the first chromosome or chromosome segment of interest from the diploid subset of patients; adjusting the genetic data for the test subject using the confirmatory bias model, to give normalized genetic data for the test subject; and determining, using the normalized data, whether genetic data from the test subject is indicative of an aneuploidy in the first chromosome or chromosome segment of interest.

In another embodiment, a method of the invention includes both a non-allelic z-score based quantitative method and a maximum likelihood method based on allelic or non-allelic data. Accordingly, provided herein is a method for detecting a presence or absence of aneuploidy of a chromosome or chromosome segment of interest in a test sample, that includes the following steps: obtaining genetic data for the chromosome or chromosome segment of interest from each sample in a set of samples comprising the test sample, wherein the genetic data is obtained from a parallel analysis of the samples; determining whether aneuploidy is present in the test sample by a first method comprising:

- a. determining a depth of reads or a proportion of reads that map to the chromosome or chromosome segment of interest;
- b. calculating a z-score for the depth of reads or the proportion of reads that map to the chromosome or chromosome segment of interest; and
- c. determining whether the test sample is aneuploidy at the chromosome or chromosome segment of interest based on the z-score, thereby providing a first result; and determining whether aneuploidy is present in the test sample by a second method comprising:
- d. creating a plurality of ploidy hypotheses wherein each ploidy hypothesis is associated with a specific copy number for the chromosome or chromosome segment of interest,
- e. determining a ploidy probability value for each ploidy hypothesis, wherein the ploidy probability value indicates the likelihood that the test sample has the specific copy number for the chromosome or chromosome segment of interest that is associated with the ploidy hypothesis, and
- f. determining which ploidy hypothesis is most likely to be correct by selecting the ploidy hypothesis with the maximum likelihood, thereby providing a second result, detecting the aneuploidy by considering the first result and the second result.

In certain illustrative examples of the above embodiments, the sample is a liquid sample, such as a sera sample. The genetic data, in these examples, can be derived from circulating DNA, such as circulating fetal DNA or circulating tumor DNA.

In certain examples of any of the above embodiments, the method further includes estimating a fetal fraction for each sample in the set of samples, wherein the fetal fraction is used in the selecting the diploid subset of samples and/or the determining whether the genetic data from the test subject is indicative of an aneuploidy.

BRIEF DESCRIPTION OF THE DRAWINGS

The presently disclosed embodiments will be further explained with reference to the attached drawings, wherein like structures are referred to by like numerals throughout the several views. The drawings shown are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the presently disclosed embodiments.

FIG. 1 is a flow chart of a method according to one embodiment of the invention.

FIG. 2 shows an example system architecture 200 useful for performing embodiments of the present invention.

FIG. 3 illustrates an example computer system for performing embodiments of the present invention.

While the above-identified drawings set forth presently disclosed embodiments, other embodiments are also contemplated, as noted in the discussion. This disclosure presents illustrative embodiments by way of representation and not limitation. Numerous other modifications and embodiments can be devised by those skilled in the art which fall within the scope and spirit of the principles of the presently disclosed embodiments.

DETAILED DESCRIPTION

Some embodiments of the present invention utilize the fact that a typical analysis for aneuploidy of a set of blood samples for example a set of blood samples from pregnant mothers in an NIPT assay, or from cancer patients in an analysis of circulating free tumor DNA (cfDNA), are run in parallel in the same set of assays, wherein many and probably most DNA in the samples originates from diploid cells. Thus, not only can on-test samples be identified that are diploid with high confidence and used as controls for analysis of other samples in the set of samples without the need for running additional control samples, in many embodiments of the present invention, the chromosome or chromosome segment of interest in test samples that are initially identified as diploid with high confidence, are used as controls for subsequent methods that analyze the rest of the set of samples for aneuploidy at that same chromosome(s) or chromosome segment(s) of interest. This substantially reduces the cost of such analysis and substantially improves the confidence in the analysis since the on-test and control data comes from the same parallel analysis of the same chromosome or same chromosome segment of interest.

Accordingly, provided herein are numerous methods and systems for determining the copy number, or detecting aneuploidy of a chromosome or chromosome segment of interest in a cell of interest that are performed using the chromosome or chromosome segment of interest to set a bias model, that is to set test parameters, or to set a threshold cutoff value, using samples analyzed in the same parallel analysis, that are identified as diploid samples with high confidence, for the analysis of aneuploidy for the same chromosome or chromosome segment of interest of other sample(s) in the set of on-test samples. The subject methods and systems can employ high throughput DNA sequencers (capable of sequencing large number of DNA templates in parallel) so as to produce quantitative information about the amount of various DNA sequences of interest in a set of samples obtained from a test subject. This quantitative sequence information can be used to determine the copy number of a chromosome or chromosome segment of interest in a cell of interest, e.g., a cell of a developing fetus or a tumor cell.

The term “depth of read” as used herein refers to the number of sequencing reads that map to a given locus. The depth of read may be normalized over the total number of reads. When “depth of read” refers to a sample, it may mean the average depth of read over the targeted loci. When “depth of read” refers to a locus, it may refer to the number of reads measured by the sequencer mapping to that locus. In general, the greater the depth of read of a locus, the closer the ratio of alleles at the locus will tend to be to the ratio of alleles in the original sample of DNA.

The term “relative fraction” can also be used to express a similar concept as depth of read. Depth of read can be expressed in variety of different ways, including but not limited to the percentage or proportion. Thus for example in a highly parallel DNA sequencer such as an Illumina HISEQ, which for example would produce sequence of 1 million clones, the sequencing of one locus 3,000 times, would result in a depth of read of 3,000 reads at that locus. The proportion of reads at that locus would be 3,000 divided by 1 million total reads, or 0.3% of the total reads.

The term “allelic data” as used herein means a quantitative measurement indicative of the number of copies of a specific allele of a polymorphic locus. Typically, quantitative measurements will be obtained for all possible alleles of the polymorphic locus of interest. In some embodiments, the polymorphic loci is a SNP, and the SNP is dimorphic, and the allelic data will comprise the quantity of each of the two alleles observed at that locus. In some embodiments, the polymorphic loci is a SNP, and the SNP is trimorphic or tetramorphic, and the allelic data will comprise the quantity of each of the three or four alleles observed at that locus. The allelic data may be obtained using a variety of well-known molecular biology techniques such as DNA sequencing or real-time PCR. High throughput DNA sequencing in which the number of individual reads of a given locus obtained can be used to obtain allelic data. When the allelic data is measured using high-throughput sequencing, the allelic data will typically comprise the number of reads of each allele mapping to the locus of interest.

The term “non-allelic data” as use herein means a quantitative measurement indicative of the number of copies of a specific locus. The locus may be polymorphic or non-polymorphic. If the locus is non-polymorphic, the non-allelic data will not contain information about the relative or absolute quantity of the individual alleles that may be present at that locus. Typically, quantitative measurements will be obtained for all possible alleles of the polymorphic locus of interest. The allelic data may be obtained using a variety of well-known molecular biology techniques such as DNA sequencing or real-time PCR. High throughput DNA sequencing in which the number of individual reads of a given locus obtained can be used to obtain allelic data. Non-allelic data for a polymorphic locus may be obtained by summing the quantitative allelic for each allele at that locus. When the allelic data is measured using high-throughput sequencing, the non-allelic data will typically comprise the number of reads of mapping to the locus of interest. The sequencing measurements could indicate the relative and/or absolute number of each of the alleles present at the locus, and the non-allelic data would comprise the sum of the reads, regardless of the allelic identity, mapping to the locus. Note that it is possible to measure the DNA at a plurality of loci, for example using high throughput sequencing, to yield allelic data; it is then possible, by summing the number of reads that correspond to each allele, at each locus, to produce non-allelic data. In some embodiments the same set of measurements can be used to yield both allelic data and non-allelic data. In some embodiments, the produced allelic data can be used as part of a method to determine copy number at a chromosome of interest, and the produced non-allelic data can be used as part of a method to determine copy number at a chromosome of interest, where the two methods are statistically orthogonal.

The term “chromosomal abnormality” as used herein refers to any deviation and the copy number of a specific chromosome or chromosome segment from the most common number of copies of that segment or chromosome, for example in a human somatic cell, any deviation from 2 copies could be regarded as a chromosomal abnormality.

The term “obtaining genetic data” as used herein refers to both, unless specifically where implicitly indicated otherwise by context, (1) acquiring DNA sequence information by laboratory techniques, e.g. use of an automated high throughput DNA sequencer, and (2) acquiring information that had been previously obtained by laboratory techniques, wherein the information is electronically transmitted, e.g. by computer over the Internet, by electronic transfer from the sequencing device, etc.

The term “target cell” as used herein refers to the cell (or cell type) that contains the chromosomes or chromosome segments that are to be quantitatively measured as a result of the subject methods. Examples of target cells include fetal cells and tumor cells. As the cells of most individuals contain a nearly identical set of nuclear DNA, the term “target cell” may be used interchangeably with the term “individual.”

The term “non-target cell” as used herein refers to cell (or cell type) that supply DNA that is analyzed in the process of performing the subject methods, but is not the cell contains the chromosomes or chromosome segments that is required to be quantitatively measured as a result of the subject methods. In some embodiments, the “non-target cell” may be closely related to the “target cell”, for example if a prostate tumor cell is the target cell a noncancerous prostate cell from the same individual may (although not necessarily) be used as a “non-target cell”. Alternately, in the case where the measurements are made on a mixture of cfDNA taken from a pregnant woman, the target cell could be from the placenta of a fetus gestating in the mother, and the non-target cells could be from the mother of the fetus. Typically, non-target cells are euploid, though this is not required.

Methods for measuring chromosome copy number in fetal cells based counting the number of DNA sequence-based reads that map to a given chromosome or chromosome segment are conveniently referred to as “counting methods”, or “quantitative methods” for analyzing chromosome copy number or chromosome segment copy number. Examples of such methods can be found, among other places, in published patent application US 2013/0172211 A1 U.S. Pat. No. 8,008,018; U.S. Pat. No. 8,467,976 B2; US published patent application US 2012/0003637 A1. Such methods typically involve creation of a reference value (cut-off value) for the number of DNA sequence reads mapping to a specific chromosome, where in a number of reads in excess of the value is indicative of a specific genetic abnormality.

Confidence refers to the statistical likelihood that the called SNP, allele, set of alleles, ploidy call, or determined number of chromosome segment copies correctly represents the real genetic state of the individual.

Ploidy Calling, also “Chromosome Copy Number Calling,” or “Copy Number Calling” (CNC), refers to the act of determining the quantity and chromosomal identity of one or more chromosomes present in a cell.

Aneuploidy refers to the state where the wrong number of chromosomes are present in a cell. In the case of a somatic human cell it refers to the case where a cell does not contain 22 pairs of autosomal chromosomes and one pair of sex chromosomes. In the case of a human gamete, it refers to the case where a cell does not contain one of each of the 23 chromosomes. In the case of a single chromosome, it refers to the case where more or less than two homologous but non-identical chromosomes are present, and where each of the two chromosomes originate from a different parent.

Ploidy State refers to the quantity and chromosomal identity of one or more chromosomes in a cell.

Allelic Data refers to a set of genotypic data concerning a set of one or more alleles. It may refer to the phased, haplotypic data. It may refer to SNP identities, and it may refer to the sequence data of the DNA, including insertions, deletions, repeats and mutations. It may include the parental origin of each allele.

Allelic Distribution refers to the distribution of the set of alleles observed at a set of loci. An allelic distribution for one locus is an allele ratio.

Allelic Distribution Pattern refers to a set of different allele distributions for different parental contexts. Certain allelic distribution patterns may be indicative of certain ploidy states.

Allelic Bias refers to the degree to which the measured ratio of alleles at a heterozygous locus is different to the ratio that was present in the original sample of DNA. The degree of allelic bias at a particular locus is equal to the observed allelelic ratio at that locus, as measured, divided by the ratio of alleles in the original DNA sample at that locus. Allelic bias may be defined to be greater than one, such that if the calculation of the degree of allelic bias returns a value, x, that is less than 1, then the degree of allelic bias may be restated as 1/x.

Haplotype refers to a combination of alleles at multiple loci that are transmitted together on the same chromosome. Haplotype may refer to as few as two loci or to an entire chromosome depending on the number of recombination events that have occurred between a given set of loci. Haplotype can also refer to a set of single nucleotide polymorphisms (SNPs) on a single chromatid that are statistically associated.

Haplotypic Data, also “Phased Data” or “Ordered Genetic Data,” refers to data from a single chromosome in a diploid or polyploid genome, i.e., either the segregated maternal or paternal copy of a chromosome in a diploid genome.

Phasing refers to the act of determining the haplotypic genetic data of an individual given unordered, diploid (or polyploidy) genetic data. It may refer to the act of determining which of two genes at an allele, for a set of alleles found on one chromosome, are associated with each of the two homologous chromosomes in an individual.

Phased Data refers to genetic data where the haplotype has been determined.

Target Individual refers to the individual whose genetic data is being determined. In one context, only a limited amount of DNA is available from the target individual. In one context, the target individual is a fetus. In some embodiments, there may be more than one target individual. In some embodiments, each fetus that originated from a pair of parents may be considered to be target individuals.

Child is used interchangeably with the terms embryo, blastomere, and fetus. Note that in the presently disclosed embodiments, the concepts described apply equally well to individuals who are a born child, a fetus, an embryo or a set of cells therefrom. The use of the term child may simply be meant to connote that the individual referred to as the child is the genetic offspring of the parents.

Parental Context refers to the genetic state of a given SNP, on each of the two relevant chromosomes for each of the two parents of the target.

Primary Genetic Data refers to the analog intensity signals that are output by a genotyping platform. In the context of SNP arrays, primary genetic data refers to the intensity signals before any genotype calling has been done. In the context of sequencing, primary genetic data refers to the analog measurements, analogous to the chromatogram, that comes off the sequencer before the identity of any base pairs have been determined, and before the sequence has been mapped to the genome.

Secondary Genetic Data refers to processed genetic data that are output by a genotyping platform. In the context of a SNP array, the secondary genetic data refers to the allele calls made by software associated with the SNP array reader, wherein the software has made a call whether a given allele is present or not present in the sample. In the context of sequencing, the secondary genetic data refers to the base pair identities of the sequences have been determined, and possibly also the sequences have been mapped to the genome.

Joint Distribution Model refers to a model that defines the probability of events defined in terms of multiple random variables, given a plurality of random variables defined on the same probability space, where the probabilities of the variable are linked.

Methods for Determining Aneuploidy by using Data for a Chromosome of Interest from a Diploid Sample(s) to Set a Bias Model for Other Samples in a Parallel Analysis

Provided herein in one embodiment are methods and systems for determining the copy number, or detecting aneuploidy of a chromosome or chromosome segment of interest in a cell of interest that are performed using the chromosome or chromosome segment of interest to set a bias model, that is to set test parameters, using samples analyzed in the same parallel analysis, that are identified as diploid samples with high confidence, for the analysis of aneuploidy for the same chromosome or chromosome segment of interest of other sample(s) in the set of on-test samples. Accordingly, in one example of this embodiment, provided herein is a method for determining a presence or absence of aneuploidy of a chromosome or chromosome segment of interest in a test sample, that includes the following steps:

- a) obtaining genetic data for the chromosome or chromosome segment of interest from each sample of a set of samples that includes the test sample and at least one diploid sample, wherein the genetic data is obtained from a parallel analysis of the set of samples;
- b) setting a bias model using the genetic data for the chromosome or chromosome segment of interest in the diploid sample determined to be disomic for the chromosome or chromosome segment of interest;
- c) adjusting the genetic data for the chromosome or chromosome segment of interest for the test sample using the bias model; and
- d) establishing the presence or absence of aneuploidy for the chromosome or chromosome segment of interest in the test sample using the normalized data.

In the present embodiment of the invention for determining the presence or absence of aneuploidy, the set of samples comprises the test sample and a subset of high probability diploid samples that includes at least one diploid sample. The subset of samples can include, for example, 1-1,056 samples. In illustrative methods the set or subset can be made up of 2, 3, 4, 5, 10, 20, 25, 30, 40, 50, 95, 96, 100, 150, 200, 250, 500, 750, 959, 960, 1046, 1050, 1055, 1056, or 1500 samples on the low end of the range, 3, 4, 5, 10, 20, 25, 30, 40, 50, 95, 96, 100, 150, 200, 250, 500, 750, 959, 960, 1046, 1050, 1051, 1055, 1056, or 1150, 1500, 2000, or 2500 samples on the high end of the range. The set is at least 1 sample more than the subset, and can be 2, 3, 4, 5, 10, 15, 20, 25, 30, 40, 47, 50, 95, 100, 150, 200, 250, 500, 750, or 1000 samples more in certain embodiments.

In certain examples of the invention, at least one sample known to be diploid is used as a control, and run alongside one or more target samples. For example, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40 or 50 control samples identified in advance of the run to be diploid, can be run alongside on-test samples. Any of the analytical methods disclosed herein can then be used to determine the presence or absence of aneuploidy in one or more test samples.

In certain illustrative examples of this embodiment, the at least one diploid sample is determined to be disomic for the chromosome or chromosome segment of interest by analyzing the genetic data from the parallel analysis. In certain examples, an initial or first analytical technique, identifies samples that are disomic for one or more chromosome regions with high confidence. The identity of these diploid samples that are disomic at all of the chromosome or chromosome segments of interest, are then used to set a bias model for a different analytical technique, or a second run of the same analytical technique. Thus, for examples where a sample is a sera sample, the present invention provides an advantage in that less sequencing reads, and accordingly less cost, is associated with performing the method. This is the result of the fact that for sera samples in methods analyzing circulating free DNA, especially circulating fetal DNA or circulating tumor DNA, many if not most of the samples in a parallel run, contain DNA originating from only diploid cells. In illustrative methods of the present invention, at least some of these samples are identified in an initial analysis, and their identities are used in the analysis of the other samples in the set of samples being analyzed.

Some embodiments of the invention employ the step of selecting, determining or identifying a subset of patients from a larger set of patients. The original set of patients is used as the source of target samples containing DNA from target cells and non-target samples containing DNA from non-target cells for analysis. A skilled artisan will understand that numerous methods are known in the art for obtaining genetic data for a chromosome or chromosome segment of interest from a set of samples in a parallel analysis.

In some embodiments of the invention, the DNA samples obtained, are modified using standard molecular biology techniques in order to be sequenced on a DNA sequencer. In some embodiments the technique will involve forming a genetic library containing priming sites for the DNA sequencing procedure. In some embodiments, a plurality of loci may be targeted for site specific amplification. In some embodiments the targeted loci are polymorphic loci, e.g., a single nucleotide polymorphisms. In embodiments employing the formation of genetic libraries, libraries may be encoded using a DNA sequence that is specific for the patient, e.g. barcoding, thereby permitting multiple patients to be analyzed in a single flow cell (or flow cell equivalent) of a high throughput DNA sequencer. Although the samples are mixed together in the DNA sequencer flow cell, the determination of the sequence of the barcode permits identification of the patient source that contributed the DNA that had been sequenced

Methods are known in the art for obtaining genetic data from a sample. Typically this involves amplification of DNA in the sample, a process which transforms a small amount of genetic material to a larger amount of genetic material that contains a similar set of genetic data. This can be done by a wide variety of methods, including, but not limited to, Polymerase Chain Reaction (PCR), ligand mediated PCR, degenerative oligonucleotide primer PCR, Multiple Displacement Amplification, allele-specific amplification techniques, Molecular Inversion Probes (MW), padlock probes, other circularizing probes, and combination thereof. Many variants of the standard protocol can be used, for example increasing or decreasing the times of certain steps in the protocol, increasing or decreasing the temperature of certain steps, increasing or decreasing the amounts of various reagents, etc. The DNA amplification transforms the initial sample of DNA into a sample of DNA that is similar in the set of sequences, but of much greater quantity. In some cases, amplification may not be required. Provided herein in the sample preparation section, are detailed teachings about isolation and amplification of DNA from a sample.

The genetic data of the target individual and/or of the related individual can be transformed from a molecular state to an electronic state by measuring the appropriate genetic material using tools and or techniques taken from a group including, but not limited to: genotyping microarrays, and high throughput sequencing. Some high throughput sequencing methods and systems include Sanger DNA sequencing, pyrosequencing, the ILLUMINA SOLEXA platform, ILLUMINA's GENOME ANALYZER, ILLUMINA's HISEQ or MISEQ, APPLIED BIOSYSTEM's SOLiD platform, ION TORRENT'S PGM or PROTON platforms, HELICOS's TRUE SINGLE MOLECULE SEQUENCING platform, HALCYON MOLECULAR's electron microscope sequencing method, or any other sequencing method. All of these methods physically transform the genetic data stored in a sample of DNA into a set of genetic data that is typically stored in a memory device en route to being processed.

Any relevant individual's genetic data can be obtained from the following: the individual's bulk diploid tissue, one or more diploid cells from the individual, one or more haploid cells from the individual, one or more blastomeres from the target individual, extra-cellular genetic material found on the individual, extra-cellular genetic material from the individual found in maternal blood or the blood of a cancer patient, cells from the individual found in maternal blood, one or more embryos created from (a) gamete(s) from the related individual, one or more blastomeres taken from such an embryo, extra-cellular genetic material found on the related individual, genetic material known to have originated from the related individual, and combinations thereof. In illustrative embodiments, methods provided herein are used to analyze free DNA originating from the genome of a target sample from a target cell, such as fetal cell or tumor cell.

It will be appreciated by those of ordinary skill in the art that in those embodiments of the invention in which the target DNA is not enriched for specific loci, the entire genome may be sequenced, although assembly of the sequence in to a complete genome is not required for use of the subject methods. Allelic data about specific loci may be readily determined from all genome sequencing. Cell free DNA may be conveniently analyzed in commercially available high throughput DNA sequencers. Such high throughput DNA sequencers may also be used in embodiments of the invention employing the targeted amplification of loci of interest, including polymorphic loci.

The term “cell free DNA” as used herein refers to DNA that is available for analysis without requiring the step of lysing cells. Cell free DNA can be found in blood or other bodily fluids. Cell free DNA may be obtained from a variety of tissues. Such tissues may be tissues that are in liquid form such as blood, lymph, ascites fluid, cerebral spinal fluid, and the like. Cell free DNA may be from a variety of cellular sources. In some cases the cell free DNA will be comprised of DNA derived from fetal cells. The cell free DNA may be a mixture of DNA derived from target cells and non-target cells. In the case of the analysis of DNA for fetal aneuploidies, in some embodiments the cell free DNA may be obtained from the blood of the pregnant woman, wherein the cell free DNA comprises a mixture of maternally derived cell free DNA and fetally derived cell free DNA. In other embodiments, the cell free DNA may be derived from a cancerous tumor cell. The cell free DNA may comprise a mixture of cell free DNA derived from the tumor cell and cell free DNA derived from non-tumor cells elsewhere in the body.

Genetic data, e.g., DNA sequence data, can be obtained from a mixture of DNA comprising DNA derived from one or more target cells and DNA derived from one or more non-target cells. The target cells and non-target cells differ with respect to one another at the genomic level, as by virtue of other criteria. The term “derived” is used to indicate that the cells are the ultimate source of the DNA. Thus, for example, cell-free DNA obtained from maternal blood of pregnant woman is derived from cells from the placenta of the fetus, which are typically genetically identical to the fetus itself, and the mother's cells. The method employs a set of patients.

The genetic data is obtained from each member of the patient set typically in a parallel biochemical analysis (i.e. a single assay run). Each patient in the set of patients is analyzed using essentially the same method of nucleic sequence analysis, e.g., the same amplification and sequencing reagents analyzed at the same time on the same run of the same instruments. In some embodiments, all of the samples in the set of samples are mixed together and analyzed so that the analysis conditions will be essentially identical; the analysis of the mixed samples may be termed an experiment, or a sample run or a parallel analysis. In some embodiments, the samples may be mixed prior to amplification. In some embodiments the samples may be mixed after some amplification steps but before other amplification steps. In some embodiments, the samples may be mixed after the amplification steps, but before the sequencing step. Methods for barcoding, known in the art and further discussed herein, help to facilitate simultaneous analysis of multiple samples because the identity of a sample can be determined by a barcode sequence associated with nucleic acids derived from that sample.

In some embodiments, especially those involving methods that provide likelihoods of a ploidy state using hypothesis testing, genetic information is obtained at a plurality of loci. In some embodiments, at least some, and possibly all of the loci are polymorphic. In some embodiments, all of the loci are non-polymorphic. The same loci are analyzed in both the target and non-target cells. A number of sequence reads is obtained for each locus. In some embodiments the number of each allele at a given locus is quantitated. The quantitative data obtained can be from a combination of the loci from the target cell and the non-target cell genomes. Accordingly, in some embodiments, the genetic data provides an amount of DNA corresponding to each locus in a set of loci wherein the loci are present on the chromosome or chromosome segment of interest. In illustrative examples, each chromosome or chromosome segment of interest, can include 10, 15, 20, 25, 30, 40 50, 100, 250, 500, or 1000, 1500, 2000, 2500, 5000, or 10,000 loci on the low end of the range, and 15, 20, 25, 30, 40 50, 100, 250, 500, or 1000, 1500, 2000, 2500, 5000, 10,000 or 25,000 loci on the high end of the range.

The amount of each locus detected by sequencing preparations of a DNA obtained from target and non-target cells can vary from locus to locus for reasons other than the starting quantity of the locus in the initial sample material prior to preparation for sequencing, e.g. prior to an amplification step such as PCR. Variables such as PCR primer binding efficiency, amplicon length, GC content, and the like can cause variations in the representation of individual loci in a preparation for sequencing or during sequencing. Factors such as these can result in a locus specific bias causing the overrepresentation or underrepresentation of one locus to another. In addition, bias can result from sample-specific inconsistencies. For example, due to a pipetting error or other measurement error during physical processing of the samples, one sample can have more DNA than another sample in the set of samples. In illustrative embodiments, these sample-specific biases are taken into account by sample-specific parameters. Certain sample-specific parameters, such as alpha, in the QMM section herein, can be identified based on observing certain properties of data. Anther sample specific parameter illustrated in the QMM section herein is the factors c_s, and Ts which are constant per sample, and represents for example the initial quantity of DNA and the total number of sequence reads. It can be thought of as the sample-specific amplification factor.

Independent of the specific method used to produce the genetic information, the amount of genetic sequence information from each locus is dependent upon the relative quantity of the copy numbers of the loci in the original sample. Loci that are believed to be on the same chromosomal segment, or in some embodiments the same chromosome, are presumed to have the same starting amount. Thus, for example, the multiple loci present on chromosome 21 in the genome of the target cell (or the genome of the non-target cell) are presumed to be present in approximately equal amounts in the genomic DNA. Thus differences in the amount of observed genetic information between loci on the same chromosome are the result of locus specific bias. For example if SNP1 and SNP2 are located on the same chromosome and assumed to have the same copy number on the same chromosome, and SNP1 is found to have depth of read of 0.1% and SNP2 is found to have a depth of read of 0.4%, this may be explained by a quantifiable locus specific bias favoring the production of DNA sequence from SNP2 over SNP1. This bias may be additionally normalized by virtue of considering the distribution of possible sampling outcomes for the two different SNPs. Thus, methods of the present invention, analyze bias and provide a bias model, as discussed more fully herein. In illustrative embodiments of the invention, bias models are created from chromosome and chromosome segments of interest, in certain embodiments without the use of control chromosome or chromosome segment.

Accordingly, the selected high confidence diploid subset of patients, in certain examples are used to set a bias model. Small variations in reaction conditions mean that samples run at different times experience slightly different conditions, resulting in different relative rates of enrichment and measurement for different molecules of DNA. Various parameters, including reaction-specific, sample-specific parameters and target locus-specific parameters can be set as part of the bias model, allowing normalization of the differing relative rates of enrichment and measurement.

Examples of such biases include amplification bias, sequencing bias, processing bias, enrichment bias, measurement bias, and combinations thereof. The nature of such biases may vary in accordance with the specific amplification technology, sequencing technology, processing, enrichment technology, and particular conditions present for a specific reaction, etc. selected for implementation of the specific embodiment. For example, the diploid sample subset can be used to calculate a per-sample constant of normalization that reflects the overall number of reads in the sample, e.g. the percentage of reads in a sample. In another embodiment, the diploid sample subset can be used to calculate a per-locus constant of normalization that reflects the overall number of reads in the sample, e.g. the percentage of reads in a sample In some embodiments, the relative amount of DNA from each sample that is present in the experiment can be calculated, and used to normalize other sample data parameters. In some embodiments, the proportion of DNA mapping to a chromosome of interest can be calculated, and the proportion of DNA mapping to the chromosome of interest from the selected subset of patients can be used to calculate a per-experiment constant of normalization that reflects the proportion or overall amount of DNA from the chromosome of interest that is expected for a normal sample, e.g. the percentage of reads in a sample that map to the chromosome of interest. In some embodiments, the relative amount of DNA mapping to each of a plurality of targeted loci can be calculated in the selected subset of patients, and this can be used to calculate a per-experiment, per-locus constant of normalization that reflects the amplification and/or measurement bias for each locus. In some embodiments, the bias model could be used to create a noise parameter that aggregates amplification bias and various possible errors such as transcription error rates, contamination rates, and/or sequencing error rates. In certain examples, a bias model, or a portion thereof, such as an allele-specific amplification bias, can be used by a method that initially analyzed the genetic data to identify diploid samples, that was calculated from data from a prior run.

In some examples of this embodiment of the invention, quantitative allelic and non-allelic data are both analyzed so as to produce an identification of the number of chromosomes or chromosome segments of interest with a higher level of confidence than using the allelic data or non-allelic data alone. The data can be from the same set of loci and in fact the same data, analyzed separately for different alleles or as a combined sum for all alleles of a locus or a haplotype.

In certain illustrative examples of this embodiment, quantitative allelic information is used to determine the copy number of the chromosome of interest or the chromosome segment of interest without relying on a cut-off value. Polymorphic loci, e.g., from SNPs that are heterozygous between the target cell and the non-target cell, e.g. a fetus and its mother, can be used to determine the copy number of chromosomal or chromosomal segment based on quantitative allelic data from the polymorphic loci. Provided herein is an exemplary allele-based maximum likelihood method called the “heterozygote method” or “het rate method” of determining chromosome or chromosome segment copy number. Polymorphic loci that are heterozygous between the target cell and the non-target cell, e.g., a fetus and its mother, can be used to determine the relative amounts of target cell DNA and non-target cell DNA in the sample for analysis. The quantity of genetic information from the polymorphic loci is dependent upon the amount of genetic starting material and the relative amounts of DNA from the target cells and the non-target cells. The ratio of alleles at a plurality of polymorphic loci can be determined and tested against models corresponding to predicted allele ratios for various chromosome copy number (or chromosome segment copy number) hypotheses. The effects on predicted data for differing ratios of target cell DNA and non-target cell DNA are included in such models. For example, in the case of testing cell free DNA in the blood of a pregnant woman, the potential different fetal fractions (ratio of fetal DNA to total DNA; also referred to herein as child fraction) can be modeled.

In certain embodiments, diploid samples are determined and/or aneuploidy of the chromosome or chromosome segment of interest are established using one or two algorithms that provide maximum likelihoods. In these methods the collected data is typically tested against a plurality of copy number hypotheses. The copy number hypotheses can be created for the number of copies of a chromosome or number of copies of a chromosome segment of the target cell. Each hypothesis is tested against the genetic data obtained from the loci. The testing of a hypothesis against the genetic data results in the calculation of a probability value that the copy number hypothesis is correct (or conversely incorrect). In some embodiments wherein the genetic data is obtained from cell free DNA obtained from the blood of a pregnant woman, the hypothesis can include a condition that the mother is carrying multiple fetuses, e.g., twins.

The probability value is used to select a subset of patients consisting of those patients that are the source of genetic data that is found to match a specific copy number hypothesis with a specified level of confidence. In essence, a subset of patients is selected, wherein the selected subset of patients matches the selected hypothesis with a high level of confidence, the high level of confidence being specified for the specific embodiment. In illustrative examples of this embodiment of the invention, for example, the hypothesis could be that chromosome 21 has 2 copies, chromosome 13 has 2 copies, and chromosome 18 has 2 copies. Samples meeting this hypothesis with high confidence in an NIPT analysis are considered diploid samples in this embodiment. These diploid samples are then used to set a bias model. In some embodiments, the bias model is used by the same analysis technique to reassess the samples that were not included in the diploid sample subset. In other embodiments, a second analytical technique is used to analyze one or more samples in the set of samples that were not identified in the initial analysis as members of the diploid subset.

In some embodiments, a set of at least one ploidy state hypothesis can be created for each of the chromosomes of interest of the target individual. Each of the ploidy state hypotheses may refer to one possible ploidy state of the chromosome or chromosome segment of the target individual. The set of hypotheses may include some or all of the possible ploidy states that the chromosome of the target individual may be expected to have. Some of the possible ploidy states may include nullsomy, monosomy, disomy, uniparental disomy, euploidy, trisomy, matching trisomy, unmatching trisomy, maternal trisomy, paternal trisomy, tetrasomy, balanced (2:2) tetrasomy, unbalanced (3:1) tetrasomy, other aneuploidy, and they may additionally involve unbalanced translocations, balanced translocations, Robertsonian translocations, recombinations, deletions, insertions, crossovers, and combinations thereof.

In some embodiments, the knowledge of the determined ploidy state may be used to make a clinical decision. This knowledge, typically stored as a physical arrangement of matter in a memory device, may then be transformed into a report. The report may then be acted upon. For example, the clinical decision may be to terminate the pregnancy; alternately, the clinical decision may be to continue the pregnancy. In some embodiments the clinical decision may involve an intervention designed to decrease the severity of the phenotypic presentation of a genetic disorder, or a decision to take relevant steps to prepare for a special needs child.

Some of the math in the presently disclosed embodiments makes hypotheses concerning a limited number of states of aneuploidy. In some cases, for example, only zero, one or two chromosomes are expected to originate from each parent. In some embodiments of the present disclosure, the mathematical derivations can be expanded to take into account other forms of aneuploidy, such as quadrosomy, where three chromosomes originate from one parent, pentasomy, hexasomy etc., without changing the fundamental concepts of the present disclosure. At the same time, it is possible to focus on a smaller number of ploidy states, for example, only trisomy and disomy. Note that ploidy determinations that indicate a non-whole number of chromosomes may indicate mosaicism in a sample of genetic material.

In some embodiments, the genetic abnormality is a type of aneuploidy, such as Down syndrome (or trisomy 21), Edwards syndrome (trisomy 18), Patau syndrome (trisomy 13), Turner Syndrome (45X0) Klinefelter's syndrome (a male with 2 X chromosomes), Prader-Willi syndrome, and DiGeorge syndrome. Congenital disorders, such as those listed in the prior sentence, are commonly undesirable, and the knowledge that a fetus is afflicted with one or more phenotypic abnormalities may provide the basis for a decision to terminate the pregnancy, to take necessary precautions to prepare for the birth of a special needs child, or to take some therapeutic approach meant to lessen the severity of a chromosomal abnormality.

In certain embodiments of the invention, one or two maximum likelihood methods are used. As disclosed above, the first maximum likelihood method can be used to identify diploid samples in the set of samples and to determine a first probability that the other samples in the set of samples are aneuploidy. Accordingly, in certain embodiments, one or more or all of the chromosome(s) or chromosome segment(s) of interest are determined to be disomic using a first maximum likelihood method. The method includes the following steps:

creating, for each sample in the set of samples, a plurality of first hypotheses wherein each first hypothesis is associated with a specific copy number for the chromosome or chromosome segment of interest,
determining a first probability value for each first hypothesis, wherein the first probability value indicates the likelihood that the sample has the number of copies of the chromosome or chromosome segment that is associated with the first hypothesis, wherein the first probability values are derived from the genetic data associated with the sample, and
selecting at least one diploid sample by selecting those one or more samples that most closely match a disomic copy number hypothesis for the chromosome or chromosome segment of interest, with at least a minimum level of confidence. That is, by selecting those samples that yield the highest probability of being disomic for the chromosome or chromosome segment of interest.

In certain examples using a maximum likelihood allelic method, the method is performed by analyzing a second chromosome or chromosome segment of interest and a third chromosome or chromosome of interest in the parallel analysis, wherein the diploid samples are identified by a method comprising comparing genetic data from the first, second, and third chromosome or chromosome segments of interest for each sample of the set of samples.

In these embodiments wherein the method includes at least two maximum likelihood analysis, the presence or absence of aneuploidy, or the number of copies of the chromosome or chromosome segment of interest, is determine by creating a plurality or set of 2^ndploidy hypotheses, wherein each 2^ndploidy hypothesis is associated with a specific copy number of the chromosome or chromosome segment of interest in the target cell. The models are then used to test how well the genetic data from each patient fits each 2^ndhypothesis. The goodness of fit for each 2^ndhypothesis is determined. A second probability value is calculated for each second hypothesis wherein the second probability value indicates the likelihood that the genome of the target cell has the number of chromosomes or chromosome segments that is specified by the second hypothesis. Thus by selecting the 2^ndhypothesis with the maximum likelihood, one may determine the copy number for the chromosome or chromosome segment in the genome of the target cell. Such first and second hypothesis can be considered in combination to increase the confidence of the aneuploidy determination, as discussed more fully herein.

In this embodiment of the invention, the subset of samples that is selected because they are identified as diploid samples with high confidence, is used to create a bias model. The bias model is created using the genetic data for the chromosome or chromosome segment of interest. In certain illustrative embodiments, high confidence diploid samples are identified, and/or the bias model is created without using a control chromosome or control chromosome segment.

In some embodiments the 1^sthypotheses are the same as the 2^ndh_ypotheses.

In certain embodiments, determining a first probability value for each first hypothesis includes the following:

- a. determining an initial probability of each first hypothesis for each grid point using a uniform hypothesis prior on a 2d grid of fetal fraction and the second bias model;
- b. determining a parameter distribution for each chromosome or chromosome segment of interest based on the initial probability;
- c. determining a composite parameter distribution from the parameter distribution for each chromosome or chromosome segment of interest;
- d. determining a posterior probability of each first hypothesis based on the composite parameter distribution; and
- e. repeating steps (a)-(e) using the posterior probability as a new initial probability for each iteration until convergence is reached.

In certain embodiments, determining a ploidy probability value for each ploidy hypothesis comprises:

- a. determining an initial probability of each ploidy hypothesis for each grid point using a uniform hypothesis prior on a 2d grid of fetal fraction and the first bias model;
- b. determining a parameter distribution for each chromosome or chromosome segment of interest based on the initial probability;
- c. determining a composite parameter distribution from the parameter distribution for each chromosome or chromosome segment of interest;
- d. determining a posterior probability of each ploidy hypothesis based on the composite parameter distribution; and
- e. repeating steps (a)-(e) using the posterior probability as a new initial probability for each iteration until convergence is reached.

In these embodiments, the second bias model, can be the noise parameter discussed herein. The above embodiments that analyze grid points can be used in certain examples, with a quantitative allelic method. Further disclosure related to the above grid point hypothesis testing is found in the het rate section herein.

In some embodiments the genetic data obtained from the target and non-target cells identifies the alleles of polymorphic loci and the number of reads of each allele is quantitatively measured. Each 1^sthypothesis is tested against a model specifying a specific distribution of quantitative allelic data at the plurality of polymorphic loci. Probability values are determined by calculating for each hypothesis the fit between the expected genetic data and the obtained, i.e. measured, genetic data. The probabilities can be weighted for the biological probability that a given genetic event is likely to occur.

In some embodiments the genetic data obtained from the target and non-target cells identifies the alleles of polymorphic loci and the number of reads of each is quantitatively measured without regard for the identity of the specific alleles. Each first hypothesis, or in illustrative embodiments second hypothesis is tested against a model specifying a specific distribution of quantitative allelic data at the plurality of loci analyzed. Probability values are determined by calculating for each hypothesis the fit between the expected genetic data and the obtained, i.e. measured, genetic data.

In some embodiments, the genetic data comprises quantitative genetic data from a plurality of non-polymorphic loci in which the 2^ndhypothesis specifies an expected distribution of quantitative data at the plurality of non-polymorphic loci and where in the 2^ndprobability values are determined by calculating, for each 2^ndhypothesis the goodness of fit between the expected genetic data and the normalized genetic data. In these embodiments, a test statistic, as disclosed herein for the QMM method, or a z-score could be determined.

In one embodiment of the present disclosure, where the method used to determine the ploidy state of a fetus, the method further includes taking into account the fraction of fetal DNA in the sample. In one embodiment of the present disclosure, the method involves calculating the percent of DNA in a sample that is fetal or placental in origin. In one embodiment of the present disclosure, the threshold for calling aneuploidy is adaptively normalized based on the calculated percent fetal DNA. In some embodiments, the method for estimating the percentage of DNA that is of fetal origin in a mixture of DNA, comprises obtaining a mixed sample that contains genetic material from the mother, and genetic material from the fetus, obtaining a genetic sample from the father of the fetus, measuring the DNA in the mixed sample, measuring the DNA in the father sample, and calculating the percentage of DNA that is of fetal origin in the mixed sample using the DNA measurements of the mixed sample, and of the father sample.

In one embodiment of the present disclosure, the fraction of fetal DNA, or the percentage of fetal DNA in the mixture can be measured. In some embodiments the fraction can be calculated using only the genotyping measurements made on the maternal plasma sample itself, which is a mixture of fetal and maternal DNA. In some embodiments the fraction may be calculated also using the measured or otherwise known genotype of the mother and/or the measured or otherwise known genotype of the father. In some embodiments the percent fetal DNA may be calculated using the measurements made on the mixture of maternal and fetal DNA along with the knowledge of the parental contexts. In one embodiment the fraction of fetal DNA may be calculated using population frequencies to adjust the model on the probability on particular allele measurements.

The accuracy of a ploidy determination is typically dependent on a number of factors, including the number of reads and the fraction of fetal DNA in the mixture. The accuracy is typically higher when the fraction of fetal DNA in the mixture is higher. At the same time, the accuracy is typically higher if the number of reads is greater. It is possible to have a situation with two cases where the ploidy state is determined with comparable accuracies wherein the first case has a lower fraction of fetal DNA in the mixture than the second, and more reads were sequenced in the first case than the second. It is possible to use the estimated fraction of fetal DNA in the mixture as a guide in determining the number of reads necessary to achieve a given level of accuracy.

In an embodiment of the present disclosure, a set of samples can be run where different samples in the set are sequenced to different reads depths, wherein the number of reads run on each of the samples is chosen to achieve a given level of accuracy given the calculated fraction of fetal DNA in each mixture. In one embodiment of the present disclosure, this may entail making a measurement of the mixed sample to determine the fraction of fetal DNA in the mixture; this estimation of the fetal fraction may be done with sequencing, it may be done with TaqMan, it may be done with another qPCR method, it may be done with SNP arrays, it may be done with any method that can distinguish different alleles at a given loci. The need for a fetal fraction estimate may be eliminated by including hypotheses that cover all or a selected set of fetal fractions in the set of hypotheses that are considered when comparing to the actual measured data. After the fraction fetal DNA in the mixture has been determined, the number of sequences to be read for each sample may be determined.

Accordingly, certain examples of the method for determining aneuploidy further include estimating a fetal fraction for each sample in the set of samples, wherein the fetal fraction is used to select the diploid subset of samples and/or to determine whether the genetic data from the test subject is indicative of an aneuploidy. That is, fetal fraction can be used in one or both methods used in a method for determining aneuploidy wherein a first method is used to identify diploid samples and a second method uses those diploid centers to determine wither another sample in a set of samples is an aneuploidy sample.

FIG. 1 is a non-limiting example of a method 100 of the invention that includes the use of a first method for identifying a subset of diploid samples and a second method to increase the accuracy and/or confidence of detection of aneuploidy in NIPT. The method starts at block 102, where genetic data is obtained from a mixture of target DNA and non-target DNA for each sample from a set of samples, one for each patient in a set of patients, by running a plurality of samples from pregnant mothers in parallel. That is, the samples are analyzed together at the same time typically using the same common reagents and instruments. In this example, the set of samples are barcoded, mixed, and amplified in the same reaction. Then, the set of samples are amplified in parallel using the same or nearly the same conditions. Next, the set of samples are sequenced on the same sequencing flow cell using the same conditions.

In block 104, a method (block 105) is used to make an initial determination of the copy number of the chromosome of interest in each of the samples from the set of samples. In one example, the initial determination is a made by a method that relies on allelic data, for example, the het rate method 108 provided herein. In other examples, the initial determination is made by a quantitative method 106 that relies on non-allelic data.

In block 110, a subset of samples from the plurality of samples is selected where the likelihood is very high that each of the chromosomes in the subset of samples are normally represented (i.e. diploid). In one example, one could choose only those samples for inclusion into the subset where a non-allelic quantitative method 106, such as a method that determines a depth of read is used and the absolute value of the z-score is less than 0.5, 1, 1.5, or 2 , for example, or where the z-score is indicative of disomy with at least a minimum level of confidence (e.g. 90%, 95%, 99%, 99.5%, 99.6%, 99.7%, 99.8% or 99.9%, for example). Alternately, one could choose only those samples for inclusion into the subset where a het rate method 108 is used that calculates a confidence, and where the calculated confidence that the chromosome is disomic is greater than 0.9, 0,95, 0.990, 0.9995, 0.9996, 0.997, 0998, or 0.9999, for example. Alternately, one could simply choose a fixed size subset by choosing those samples with the highest confidence or the z-score closest to zero. For example, one could analyze 96 samples and choose the 24 samples with the highest confidence of disomy for inclusion into the subset.

Once the subset of samples has been chosen, it is possible to use the genetic data measured on the chromosome of interest in the subset of samples as a reference set of samples for a secondary analysis of the samples in the plurality of samples that are not in the subset of samples. For example, the conditions in any given amplification reaction and sequencing run are slightly different, resulting in slightly different profiles of amplification rates and biases for each locus and for each sample. If a plurality of samples are run using the same conditions, and especially if those samples are run in one homogenous mixture, then it is reasonable to believe that the relative amplification and other processing biases will be minimized. A second analytical method (block 117) can then use the reference samples to create a model of the biases (block 112), either on a per-SNP basis (i.e. what is the relative amplification and processing bias for each SNP), a per-region basis (i.e. what is the relative amplification and processing bias for each region of DNA), or on a per-chromosome basis (i.e. what is the relative amplification and other processing bias for each chromosome). In certain examples not illustrated in FIG. 1, the first analytical method (block 105) can use a bias model as well, which can be built using data from the run or which can be built from previous data.

In block 114, the genetic data for all patients in the set of patients from the parallel analysis is then normalized according to the bias model from block 112, to correct bias errors where appropriate.

In block 116, once the data is normalized, samples with unknown ploidy states can then be analyzed a second time to determine the copy number, by comparing a set of copy number hypotheses to the normalized genetic data. This may be done using the same or a different method as the initial determination in block 104. For example, this second analytical method 117 may be performed using a quantitative method 118, such as the QMM method 118 as discussed below, a het rate method 120, as also discussed below, or samples with unknown ploidy could be analyzed by both an allelic method such as a het rate method 120 and a quantitative non-allelic method 188 such as QMM. For methods that generate a maximum likelihood, such as QMM and het rate, a copy number probability value is then determined in block 122 for each copy number hypothesis.

In another example, blood is drawn from 96 pregnant women who want to know if their fetuses have Down syndrome, or trisomy 21. These 96 samples are then all be processed and biochemically analyzed together, or in parallel. In other examples, the number of samples run in parallel could be, for example, at least 3, 8, 24, 36, 48, 72, 108, 144, 288, or 396. In certain examples, no more than 396 samples are analyzed in parallel. The DNA from each of the samples then has a barcode attached, and then all of the sample are pooled and amplified. The amplified DNA is then sequenced using a high throughput sequencer (e.g. block 102). Then a het rate method (e.g. block 108) is used to analyze each of the samples. The 24 samples with the highest confidence for disomy at chromosome 21 is selected to select (i.e. identify) a subset of samples that could act as a reference subset (e.g. block 110). Alternately, a quantitative method could be used to analyze each of the samples to give a preliminary estimate of the proportion of DNA mapping to chromosome 21 (e.g. block 106). The 24 samples with the z-score closest to zero could be selected as a subset of samples that could act as the reference subset (e.g. block 110).

Once the reference or control subset is chosen (e.g. block 112), a second analytical method (117) can make the assumption that these cases are disomic, and then estimate the per-SNP bias, that is, the experiment-specific amplification and other processing bias for each locus using these diploid samples. Then, the second method (117) can use this experiment-specific bias estimate to correct the bias in the measurements of genetic data (e.g. sequencing reads) of the chromosome 21 loci, and for other chromosome loci (e.g. chromosomes 13, 18, X, and Y) as appropriate, for the 72 samples that are not part of the subset where disomy was assumed for chromosome 21 (e.g. block 114).

Once the reference (i.e. control) diploid samples have been selected (i.e. identified) (110), the data from the 72 samples with unknown ploidy state can then be analyzed a second time using the same or a different method (117) to determine whether the fetuses are afflicted with trisomy 21. The reference diploid subset of samples are used to set a bias model (112) that is used by a second method (117) to normalize the genetic data from the samples that were not selected as members of the high confidence diploid subset. For example, a quantitative method could be used on the remaining 72 samples, and a z-score could be calculated using the corrected measured genetic data on chromosome 21 (e.g block 118).

In certain embodiments, the bias correction or normalization of the genetic data is done as part of the second analysis. As part of the preliminary estimate of the ploidy state of chromosome 21, a fetal fraction, in certain examples, is calculated. The proportion of corrected reads that would be expected in the case of a disomy (the disomy hypothesis), and the proportion of corrected reads that would be expected in the case of a trisomy (the trisomy hypothesis) are calculated for a case with that fetal fraction. Alternately, if the fetal fraction was not measured previously, a set of disomy and trisomy hypotheses are generated for different fetal fractions. For each case, an expected distribution of the proportion of corrected reads are calculated given expected statistical variation in the selection and measurement of the various DNA loci. The observed corrected proportion of reads are compared to the distribution of the expected proportion of corrected reads, and a likelihood ratio is calculated for the disomy and trisomy hypotheses, for each of the 72 samples. The ploidy state associated with the hypothesis with the highest calculated likelihood, for each of the 72 samples, in this example, is selected as the correct ploidy state. In another embodiment, the corrected genetic data for the remaining 72 samples is analyzed using a plurality of orthogonal methods, and the resulting likelihoods are then combined to give a combined likelihood which is used to determine the actual ploidy state of each of the fetuses. In one embodiment, an allelic maximum likelihood method, such as the het rate method and a quantitative method, such as the QM NI method, are each used to determine the likelihood of disomy and trisomy in the fetus, and these likelihoods are combined or considered together in a set of rules that provide a output of whether a sample exhibits aneuoploidy in any of the chromosome or chromosome segments of interest. It will be apparent to an ordinary person skilled in the art how any of the approaches disclosed herein could be used for other types of whole chromosome abnormalities. Furthermore, it will be apparent to an ordinary person skilled in the art how any of the approaches disclosed herein could be used for other types of partial chromosomal abnormalities, for example, a microdeletion, a micro duplication, or an unbalanced translocation.

In some embodiments the target cells are fetal cells and non-target cells are from the mother of the fetus. In some embodiments the invention is directed to non-invasive prenatal diagnosis, and the target cells may be fetal cells and the non-target cells may be maternal cells. In some embodiments of the invention an example of a hypothesis that may be used to select the subset of samples is the hypothesis that a specific chromosome or chromosome segment is diploid i.e. present in 2 copies. Examples of chromosomes for analysis include chromosomes 13, 18, 21, X and Y, including segments thereof. For example, the subset of samples may be chosen on the basis of having the highest likelihood that all or nearly all of the DNA in the sample originated from cells with precisely two copies of the chromosome of interest. In certain embodiments, the chromosomes that are analyzed are chromosomes 13, 18, and 21.

In some embodiments, the chromosome segment (s) that is analyzed for copy number is at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or all segments selected from the group consisting of chromosome 22q11.2, chromosome 1p36, chromosome 15q11-q13, chromosome 4p16.3, chromosome 5p15.2, chromosome 17p13.3, chromosome 22q13.3, chromosome 2q37, chromosome 3q29, chromosome 9q34, chromosome 17q21.31, and the terminus of a chromosome.

Note that it has been demonstrated that DNA that originated from cancer that is living in a host can be found in the blood of the host. In the same way that genetic diagnoses can be made from the measurement of mixed DNA found in maternal blood, genetic diagnoses can equally well be made from the measurement of mixed DNA found in host blood. The genetic diagnoses may include aneuploidy states, or gene mutations. Any claim in that patent that reads on determining the ploidy state or genetic state of a fetus from the measurements made on maternal blood can equally well read on determining the ploidy state or genetic state of a cancer from the measurements on host blood.

In some embodiments, the method may allow one to determine the ploidy status of a cancer, the method comprising obtaining a mixed sample that contains genetic material from the host, and genetic material from the cancer, measuring the DNA in the mixed sample, calculating the fraction of DNA that is of cancer origin in the mixed sample, and determining the ploidy status of the cancer using the measurements made on the mixed sample and the calculated fraction. In some embodiments, the method may further comprise administering a cancer therapeutic based on the determination of the ploidy state of the cancer. In some embodiments, the method may further comprise administering a cancer therapeutic based on the determination of the ploidy state of the cancer, wherein the cancer therapeutic is taken from the group comprising a pharmaceutical, a biologic therapeutic, and antibody based therapy and combination thereof.

Accordingly, in some embodiments the target cell is a tumor cell and the non-target cell is a non-tumor cell. In some embodiments the cell free DNA comprises DNA that that has been released by apoptosis. In some embodiments the target cell is a malignant tumor cell.

In certain embodiments, the chromosome or chromosome segment of interest is known to exhibit CNV in cancer (see for example, Liu et al. Oncotarget . 2013 November; 4(11): 1868-188, incorporated by reference in its entirety and Beroukhim et al, Nature. 2010 Feb. 18; 463(7283): 899-905, incorporated by reference in its entirety). For example, the chromosome or chromosome segment of interest in certain embodiments, is a chromosome or chromosome segment comprising at least 1, 2, 3, 4, 5, 10, 15, 20 or more of the following genes: ERBB2, EGFR, MYC, PIK3CA, IGF1R, FGFR1/2, KRAS, CDK4, CCND1, MDM2, MET, CDK6 (in certain embodiments, chromosome or chromosome segments that include these genes are assayed for amplification), and RB1, PTEN, CDKN2A/B, ARID1A, MAP2K4, NF1, SMAD4, BRCA1/2, MSH2/6, DCC, CDH1 (in certain embodiments chromosome or chromosome segments that include these genes are assayed for deletion). In some embodiments wherein at least some of the genetic data is derived from circulating tumor cells, the chromosome segment, or chromosome from which the segment originates, is 1p, 2p, 2q, 3p, 3q, 4p, 5p, 5q, 6p, 6q, 7p, 7q, 8p, 8q, 9p, 9q, 10p, 10q, 11p, 11 q, 12p, 12q, 13q, 14q, 15q, 16p, 16q, 17p, 17q, 18p, 18q, 19p. 19q, 20p, 20q, 21q, 22q (See Beroukhim et al.. Nature. 2010 Feb. 18; 463 Supp FIG. 6).

The following provides a non-limiting example of a method for the detection of aneuploidy (i.e. copy number variation “CNV”) in circulating tumor DNA in a blood sample from an individual at high risk of having cancer, using a method of the invention that includes the use of a first method for identifying a subset of samples having a normal copy number for one or more target chromosome regions, and a second method to increase the accuracy of detection of CNV at the one or more target chromosome regions. Accordingly, a blood sample is collected from each of 48 patients at high-risk for breast cancer, of which for example, eleven actually have breast cancer. Blood samples are centrifuged, plasma separated, and the DNA isolated from the plasma. The isolated DNA is then amplified using a non-specific amplification for six to ten cycles in one example, or it is amplified for four to sixteen cycles in another example. The DNA is then be amplified using a targeted PCR protocol that targets a plurality of loci located across one or more chromosomal regions where amplification or deletion of the chromosomal regions are indicative of cancer; the regions may be focal regions or they may be entire arms of chromosomes, or entire chromosomes. The regions may be commonly observed to be deleted or amplified in specific cancers, they may be directly believed to affect oncogenesis, and/or they may be driver mutations. The targeted amplification or deletion may simultaneously amplify or delete single nucleotide variants implicated in oncogenesis, or correlated with the presence of a tumor. Each region may contain at least 20 loci, at least 50 loci, at least 100 loci, at least 200 loci, at least 500 loci, or at least 1,000 loci. The loci may be comprised of polymorphic loci, for example SNPs. The loci may also be comprised of non-polymorphic loci. The amplified DNA may be measured using high throughput sequencing.

The data from each of the samples is analyzed to determine which of the samples have a high likelihood of not having any CNVs. This analysis can involve analysis of allelic data, it can involve analysis of quantitative data, or it may involve analysis of both allelic and quantitative data. The determination that certain samples are most likely to not have any CNVs (i.e. a normal sample) is based in this example, on selecting the samples with the lowest fraction of tumor DNA, selecting the samples with the z-score closest to zero, selecting the samples where the data fits the hypothesis corresponding to no CNVs with the highest confidence or likelihood, selecting the samples known to be normal, selecting the samples from individuals with the lowest likelihood of having cancer (e.g. having a low age, being a male when screening for breast cancer, having no family history, etc.), selecting the samples with the highest input amount of DNA, selecting the samples with the highest signal to noise ratio, selecting samples based on other criteria believed to be correlated to the likelihood of having cancer, or selecting samples using some combination of criteria.

A subset of the 48 samples with a sufficiently low likelihood of having cancer in this example are selected to act as a control set of samples. The subset can be a fixed number or percent of samples, or it can be a variable number that is based on choosing only those samples that fall below a threshold. For example, the 25, 20, 15, 10, or 5% of samples with the lowest likelihood of aneuploidy or lowest absolute value z score can be selected as the control subset. Alternatively, the 25, 20, 15, 10 or 5 samples with the lowest likelihood of aneuploidy or lowest absolute value z-score can be selected as the control subset. The quantitative data from the subset of samples can be combined, averaged, or combined using a weighted average where the weighting is based on the likelihood of the sample being normal. The quantitative data may be used to determine the per-locus bias for the amplification the sequencing of samples as well as for sample biases and other biases disclosed herein as part of a bias model, in the instant batch of 48 samples. The per-locus bias can also include data from other batches of samples. The per-locus bias can indicate the relative over- or under-amplification that was observed for that locus compared to other loci, making the assumption that the subset of samples do not contain any CNVs, and that any observed over or under-amplification is due to amplification and/or sequencing or other bias. The per-locus bias can take into account the GC content of the amplicon. The loci can be grouped into groups of loci for the purpose of calculating a per-locus bias. Once the per-locus bias has been calculated for each locus in the plurality of loci, the sequencing data for one or more of the samples that are not in the subset of the samples, and optionally one or more of the samples that are in the subset of samples, can be corrected by adjusting the quantitative measurements for each locus to remove the effect of the bias at that locus. For example, if SNP 1 was observed, in the subset of patients, to have a depth of read that is twice as great as the average, the adjustment can involve replacing the number of reads corresponding from SNP 1 with a number that is half as great. If the locus in question is a SNP, the adjustment can involve cutting the number of reads corresponding to each of the alleles at that locus in half

Once the sequencing data for each of the loci in one or more samples has been normalized, it is analyzed using at least one method, and in illustrative embodiments at least two methods for the purpose of detecting the presence of a CNV at one or more chromosomal regions. The method can be a quantitative maximum likelihood method that uses only quantitative non-allelic data, it can be an allelic maximum likelihood method that only uses allelic data, including allele ratios or allele distributions, it may be a method that uses both quantitative non-allelic and allelic data, or it may be a method that uses other types of data. The likelihood of a CNV is calculated using such a method. The likelihoods produced for a plurality of hypotheses by more than one method is combined; if the methods are not orthogonal, that is, if the likelihoods generated have some correlation, a correction may be applied when combining the likelihoods.

For example, sample A, a mixture of amplified DNA originating from a mixture of normal and cancerous cells, is analyzed using a quantitative method: a region of the q arm on chromosome 22 is found to only have 90% as much DNA mapping to that region as expected; a focal region corresponding to the HER2 gene is found to have 150% as much DNA mapping to that region as expected; and the p-arm of chromosome 5 is found to have 105% as much DNA mapping to it as expected. A clinician can infer that the sample has a deletion of a region on the q arm on chromosome 22, and a duplication of the HER2 gene. The clinician can infer that since the 22q deletions are common in breast cancer, and that since cells with a deletion of the 22q region on both chromosomes usually do not survive, that approximately 20% of the DNA in the sample came from cells with a 22q deletion on one of the two chromosomes. The clinician may also infer that if the DNA from the mixed sample that originated from tumor cells originated from a set of genetically tumor cells whose HER2 region and 22q regions were homogenous, then the cells contained a five-fold duplication of the HER2 region. Of course tumors tend to be heterogeneous, so this may not be an appropriate assumption.

In this example, sample A is also analyzed using an allelic method: the two haplotypes on same region on the q arm on chromosome 22 are observed to be present in a ratio of 4:5; the two haplotypes in a focal region corresponding to the HER2 gene are found to be present in ratios of 1:2; and the two haplotypes in the p-arm of chromosome 5 are observed in ratios of 20:21. All other assayed regions of the genome are found to have no statistically significant excess of either haplotype. A clinician can infer that the sample contains DNA from a tumor with a CNV in the 22q region, the HER2 region, and the 5p arm. Based on the knowledge that 22q deletions are very common in breast cancer, and/or the quantitative analysis showing an under-representation of the amount of DNA mapping to the 22q region of the genome, the clinician can infer the existence of a tumor with a 22q deletion. Based on the knowledge that HER2 amplifications are very common in breast cancer, and/or the quantitative analysis showing an over-representation of the amount of DNA mapping to the HER2 region of the genome, the clinician can infer the existence of a tumor with a HER2 amplification. Based on the inferences, the clinician may decide to pursue additional diagnostic testing such as a tumor biopsy. Based on these inferences, the clinician can perform a mammogram or an ultrasound. Based on these inferences, the clinician can perform a lumpectomy, a mastectomy, or otherwise excise the tumor. Based on these inferences, the clinician can choose a course of radiation therapy, chemotherapy, immunotherapy or other cancer therapy. It is also possible to run other genetic assays in parallel or in the same assay, for example, testing for the presence of one or more SNVs. The clinician can choose the form of therapy, or combination of therapies, based on the genetic footprint, that is, the particular combination of CNVs and other mutations such as SNVs that are observed in the sample, combined with any other data such as clinical data or phenotypic data. It should be apparent to an ordinary person skilled in the art how any of the approaches discussed herein could be used for other types of cancer.

Allelic Joint Distribution Methods

In certain embodiments, methods of the invention include determining whether the distribution of observed allele measurements is indicative of a euploid or an aneuploid sample, such as a fetus or circulating tumor cell, using a joint distribution model. The use of a joint distribution model provides certain advantages over methods that determine heterozygosity rates by treating polymorphic loci independently in that the resultant determinations are of significantly higher accuracy. Without being bound by any particular theory, it is believed that one reason they are of higher accuracy is that the joint distribution model takes into account the linkage between SNPs, and likelihood of crossovers occurring. Another reason it is believed that they are of higher accuracy is that they can take into account alleles where the total number of reads is low, and the allele ratio method would produce disproportionately weighted stochastic noise. The het rate method provided herein, is an example of an allelic joint distribution method that can be used to carry out many of the embodiments provided herein.

In certain embodiments provided herein, methods of the invention include determining whether the distribution of observed allele measurements is indicative of a euploid or an aneuploidy sample using a maximum likelihood technique. The use of a maximum likelihood technique has certain advantages over methods that use single hypothesis rejection technique in that the resultant determinations will be made with significantly higher accuracy. One reason is that single hypothesis rejection techniques set cut off thresholds based on only one measurement distribution rather than two, meaning that the thresholds are usually not optimal. Another reason is that the maximum likelihood technique allows the optimization of the cut off threshold for each individual sample instead of determining a cut off threshold to be used for all samples regardless of the particular characteristics of each individual sample. Another reason is that the use of a maximum likelihood technique allows the calculation of a confidence for each ploidy call.

In certain embodiments provide herein, the method includes determining whether the distribution of observed allele measurements is indicative of a euploid or an aneuploid sample without comparing the distribution of observed allele measurements on a suspect chromosome to a distribution of observed allele measurements on a reference chromosome that is expected to be disomic. This is a significant improvement over methods that require the use of a reference chromosome to determine whether a suspect chromosome is euploid or aneuploid. One example of where a ploidy calling technique that requires a reference chromosome would make an incorrect call is in the case of a 69XXX(trisomic fetus), which would be called euploid since there is no reference diploid chromosome, while the method described herein would be able to determine that the fetus was trisomic.

In certain embodiments provided herein, the method involves using algorithms that analyze the distribution of alleles that have different parental contexts, and comparing the observed allele distributions to the expected allele distributions for different ploidy states for the different parental contexts (different parental genotypic patterns). Such algorithms are different than methods that do not utilize allele distribution patterns for alleles from a plurality of different parental contexts because they allow the use of significantly more genetic measurement data from a set of sequence data in the ploidy determination, resulting in a more accurate determination. In certain embodiments provided herein, the method includes determining whether the distribution of observed allele measurements is indicative of a euploid or an aneuploid fetus using observed allelic distributions measured at loci where the mother is heterozygous. This allows the use of about twice as much genetic measurement data from a set of sequence data in the ploidy determination than methods that do not use observed allelic distributions, resulting in some instances, in a more accurate determination.

In certain embodiments provided herein, genetic data is obtained from DNA that is isolated using a selective enrichment techniques that preserve the allele distributions that are present in the original sample of DNA. In some embodiments the amplification and/or selective enrichment technique may involve targeted amplification, hybrid capture, or circularizing probes. In some embodiments, methods for amplification or selective enrichment may involve using probes where the hybridizing region on the probe is separated from the variable region of the polymorphic allele by a small number of nucleotides. This separation results in lower amounts of allelic bias. This is an improvement over methods that involve using probes where the hybridizing region on the probe is designed to hybridize at the base pair directly adjacent to the variable region of the polymorphic allele. This is an improvement over other methods that involve amplification and/or selective enrichment methods that do not preserve the allele distributions that are present in the original sample of DNA well. Low allelic bias is critical for ensuring that the measured genetic data is representative of the original sample in methods that involve either calculating allele ratios or allele measurement distributions. Since prior methods did not focus on polymorphic regions of the genome, or on the allele distributions, it was not obvious that techniques that preserved the allele distributions would result in more accurate ploidy state determinations. Since prior methods did not focus on using allelic distributions to determine ploidy state, it was not obvious that a composition where a plurality of loci were preferentially enriched with low allelic bias would be particularly valuable for determining a ploidy state of a fetus.

The methods described herein are particularly advantageous when used on samples where a small amount of DNA is available, or where the percent of circulating DNA is low. This is due to the correspondingly higher allele dropout rate that occurs when only a small amount of DNA is available, or the correspondingly higher allele dropout rate when the percent of fetal or tumor DNA is low. A high allele dropout rate, meaning that a large percentage of the alleles were not measured for the target individual, results in poorly accurate fetal fractions calculations, and poorly accurate ploidy determinations. Since the method disclosed herein uses a joint distribution model that takes into account the linkage in inheritance patterns between SNPs, significantly more accurate ploidy determinations may be made.

In embodiments related to NPD, the parental context may refer to the genetic state of a given SNP, on each of the two relevant chromosomes for each of the two parents of the target. Note that in one embodiment, the parental context does not refer to the allelic state of the target, rather, it refers to the allelic state of the parents. The parental context for a given SNP may consist of four base pairs, two paternal and two maternal; they may be the same or different from one another. It is typically written as “m₁m₂f|f₁f₂,” where m₁and m₂are the genetic state of the given SNP on the two maternal chromosomes, and f₁and f₂are the genetic state of the given SNP on the two paternal chromosomes. In some embodiments, the parental context may be written as “f₁f₂|m₁m₂.” Note that subscripts “1” and “2” refer to the genotype, at the given allele, of the first and second chromosome; also note that the choice of which chromosome is labeled “1” and which is labeled “2” is arbitrary.

Note that in this disclosure, A and B are often used to generically represent base pair identities; A or B could equally well represent C (cytosine), G (guanine), A (adenine) or T (thymine). For example, if, at a given allele, the mother's genotype was T on one chromosome, and G on the homologous chromosome, and the father's genotype at that allele is G on both of the homologous chromosomes, one may say that the target individual's allele has the parental context of AB|BB; it could also be said that the allele has the parental context of AB|AA. Note that, in theory, any of the four possible nucleotides could occur at a given allele, and thus it is possible, for example, for the mother to have a genotype of AT, and the father to have a genotype of GC at a given allele. However, empirical data indicate that in most cases only two of the four possible base pairs are observed at a given allele. In this disclosure the discussion assumes that only two possible base pairs will be observed at a given allele, although the embodiments disclosed herein could be modified to take into account the cases where this assumption does not hold.

When considering which alleles to target, one may consider the likelihood that some parental contexts are likely to be more informative than others. For example, AA|BB and the symmetric context BB|AA are the most informative contexts, because the fetus is known to carry an allele that is different from the mother. For reasons of symmetry, both AA|BB and BB|AA contexts may be referred to as AA|BB. Another set of informative parental contexts are AA|AB and BB|AB, because in these cases the fetus has a 50% chance of carrying an allele that the mother does not have. For reasons of symmetry, both AA|AB and BB|AB contexts may be referred to as AA|AB. A third set of informative parental contexts are AB|AA and AB|BB, because in these cases the fetus is carrying a known paternal allele, and that allele is also present in the maternal genome. For reasons of symmetry, both AB|AA and AB|BB contexts may be referred to as AB|AA. A fourth parental context is AB|AB where the fetus has an unknown allelic state, and whatever the allelic state, it is one in which the mother has the same alleles. The fifth parental context is AA|AA, where the mother and father are heterozygous.

In some examples of an embodiment of the invention for detecting a presence or absence of aneuploidy or for measuring the number of copies of a chromosome or chromosome segment of interest, quantitative non-allelic genetic information can be used to determine the copy number of the chromosome or chromosomal segment of interest in the target cells. For example, a quantitative non-allelic z-score method can be used to identify at least one diploid sample in the set of samples that is disomic for the chromosome or chromosome segment of interest. In such embodiments, each sample in the set of samples can be analyzed in the following manner:

determine a proportion of reads that map to the chromosome or chromosome segment of interest; calculate a z-score for the proportion of reads that map to the chromosome or chromosome segment of interest; and
select one or more samples where the absolute value of the z-score is below a threshold value as a diploid sample, or where the z-score indicates disomy with at least a minimum level of confidence (e.g. 90, 95, 96, 97, 98, 99, 99.5, or 99.9%), or select the 20, 15, 10, or 5% of samples or the 50, 25, 20, 15, 10, or 5, 4, 3, 2, or 1 sample(s) with the lowest absolute value z-score for the set of samples.

As another non-limiting example, a quantitative non-allelic threshold method can be used to identify the presence or absence of aneuploidy in the test sample. Such a method can be performed in the following manner for each sample in the set of samples:

determine a proportion of reads that map to the chromosome or chromosome segment of interest; calculate a z-score for the proportion of reads that map to the chromosome or chromosome segment of interest; and
output whether the data for the sample yields an absolute value of the z-score above a threshold value, wherein a z score with an absolute value above the threshold is indicative aneuploidy in the sample, or whether the data for the sample yields a z-score indicative of aneuoploidy with at least a minimum level of confidence.

In related examples, non-allelic data can be used to calculate or determine a sequencing depth of read for one or more loci, or in some embodiments a depth of read for an entire chromosome of segment of a chromosome. The depth of read refers to the number of DNA fragments corresponding to the locus, chromosome segment or chromosome of interest. The number of DNA fragments may be measured using a sequencing methodology, and may refer to amplified or unamplified DNA fragments. This non-allelic depth of read information can then be compared to a threshold value (i.e., a cut off value) relating to the depth of sequencing reads from a specific chromosome or specific chromosome segment to a predicted chromosome copy number or chromosome segment copy number. In another embodiment, this non-allelic depth of read information can be used to calculate a z-score with a likelihood that a particular chromosome or chromosome segment has a particular copy number. For example, a z-score can be associated with a 70, 75, 80, 90, 95, 96, 97, 98, 99, or 99.9% confidence of a disomic or an aneuploid state for a chromosome of interest in the test sample.

In further examples non-allelic quantitative threshold methods are used to determine the copy number count of a chromosome or chromosome segment in an individual, for example as part of NPD, where the target individual is the fetus (i.e. the target cells come from the placenta), and where the related individual is the mother (i.e. the non-target cells come from the mother). In this situation, cfDNA from the maternal plasma may be amplified in a targeted or untargeted (random) fashion, and sequenced. The copy number of the chromosome of interest in the target individual may be inferred by comparing the absolute or relative number of sequence reads, or sequence tags, mapping to the chromosome of interest to the number of sequence reads, or sequence tags, mapping to one or a plurality of reference chromosomes. In certain illustrative examples, the reference chromosome is the same as the chromosome of interest for aneuploidy. In other examples, a reference chromosome or set of reference chromosomes that is different from the chromosome of interest may be used. In certain illustrative examples, a subset of samples are determined to be diploid from an initial analysis of data during a parallel analysis. For example, all samples that have a z score below an absolute value threshold such as 3, 2.5, 2, 1.5, 1, or 0.5. If the number of sequence reads mapping to the chromosome of interest for the remaining samples (those that were not determined to be diploid in the initial analysis) is disproportionately higher than would be expected given the number of sequence reads mapping to one or a plurality of reference chromosomes, then a fetal trisomy may be inferred. If the number of sequence reads mapping to the chromosome of interest is disproportionately lower than would be expected given the number of sequence reads mapping to one or plurality of reference chromosomes, then a fetal monosomy may be inferred. If the number of sequence reads mapping to the chromosome of interest is proportionate to what would be expected given the number of sequence reads mapping to the reference chromosome, then disomy may be inferred. There are many way to determine what number of sequence reads mapping to the chromosome of interest is proportionate, or disproportionate, to what would be expected, given the number of sequence reads mapping to the reference chromosome including normalization based on representation in the genome, and also including GC-bias correction, which is where the expected number of reads may be normalized based on the fact that GC-rich regions of the genome may not amplify at an equivalent rate to non-GC-rich regions of the genome.

In a related embodiment, a method of the invention includes both a non-allelic z-score based quantitative method and a maximum likelihood method based on allelic or non-allelic data. Accordingly, provided herein is a method for detecting a presence or absence of aneuploidy of a chromosome or chromosome segment of interest in a test sample, that includes the following steps: obtaining genetic data for the chromosome or chromosome segment of interest from each sample in a set of samples comprising the test sample, wherein the genetic data is obtained from a parallel analysis of the samples;

determining whether aneuploidy is present in the test sample by a first method comprising:
- a. determining a depth of reads or a proportion of reads that map to the chromosome or chromosome segment of interest;
- b. calculating a z-score for the depth of reads or the proportion of reads that map to the chromosome or chromosome segment of interest; and
- c. determining whether the test sample is aneuploidy at the chromosome or chromosome segment of interest based on the z-score, thereby providing a first result; and determining whether aneuploidy is present in the test sample by a second method comprising:
- d. creating a plurality of ploidy hypotheses wherein each ploidy hypothesis is associated with a specific copy number for the chromosome or chromosome segment of interest,
- e. determining a ploidy probability value for each ploidy hypothesis, wherein the ploidy probability value indicates the likelihood that the test sample has the specific copy number for the chromosome or chromosome segment of interest that is associated with the ploidy hypothesis, and
- f. determining which ploidy hypothesis is most likely to be correct by selecting the ploidy hypothesis with the maximum likelihood, thereby providing a second result, detecting the aneuploidy by considering the first result and the second result.

The z-score based on a non-allelic quantitative threshold or cutoff value can be determined in variety of ways, for example an average depth of read (normalized for the length of the specific chromosome) can be obtained from a chromosome or chromosome segment, i.e., a reference chromosome or chromosome segment, that is assumed or proven to have a specific copy with a high degree of certainty (e.g., chromosome 2 in a developing fetus can safely be assume to be diploid). In examples of this embodiment of the invention, the cutoff value is based on a reference chromosome or chromosome segment that is the same as a chromosome or chromosome segment having the copy number that is being measured, and in certain illustrative examples, without the use of a sample known in advance of an assay, as being diploid. In embodiments of the invention where the cutoff value is based on a reference chromosome or chromosome segment that is the same as the chromosome or chromosome segment having the copy number that is being measured, sets of patients (test subjects) can be co-analyzed in a run of a high throughput DNA sequencer, so as to produce a reference value (cutoff value). This reference value can be indicative of the number of copies of a given chromosome or chromosome segment in a patient. For example, if the amount of total DNA sequence information obtained from a specific chromosome exceeds cutoff value, it may be possible to determine that the target cell contains a trisomy on a specific chromosome with a high degree confidence of a correct determination. This probability of a specific chromosome copy number or chromosome copy number segment can be modified using a second probability value, wherein the second probability value is determined from allelic data.

When sequencing is used for ploidy calling of a fetus in the context of non-invasive prenatal diagnosis, there are a number of ways to analyze the sequence data to determine the ploidy of the fetus. In one method that is used in some embodiments provided herein, a non-allelic threshold method is used. In one example of such a method, the sequence data is used by counting the number of reads that map to a given chromosome. For example, consider an example where the goal is to determine the ploidy state of chromosome 21 on the fetus where the DNA in the sample is comprised of 10% DNA of fetal origin, and 90% DNA of maternal origin. In this case, one could identify disomic samples as samples that initially yield a z-score of below a threshold, and compare reads obtained for chromosome 21 for a test sample to average reads of chromosome 21 for the diploid samples. If the on-test fetus were euploid, one would expect the amount of DNA per unit of genome to be about equal in chromosome 21 from a disomic sample to chromosome 21 in a sample from the euploid on-test fetus. If the fetus were trisomic at chromosome 21, one the other hand, then one would expect there to be more slightly more DNA per genetic unit from chromosome 21 from the on-test sample than for the disomic sample(s) Another method that could be used to detect aneuploidy is similar to that above, except that parental contexts could be taken into account.

Methods for Determining the Number of Copies of the Chromosome or Chromosome Segment Employing a Reference Value Derived from a Subset of Patients

One embodiment of the invention is a method for determining the number of copies of a chromosome or chromosome segment of interest in the genome of a target cell, such as fetal cell or tumor cell. Genetic data, e.g., DNA sequence data, can be obtained from a mixture of DNA comprising DNA derived from one or more target cells and DNA derived from one or more non-target cells. The target cells and non-target cells differ with respect to one another at the genomic level, as by virtue of other criteria. The term “derived” is used to indicate that the cells are the ultimate source of the DNA. Thus, for example, cell-free DNA obtained from maternal blood of pregnant woman is derived from cells and the mother's cells. The method employs a set of patients. The genetic data is obtained from each member of the patient set. Each patient in the set of patients is analyzed using essentially the same method of nucleic sequence analysis, e.g., the same amplification and sequencing reagents. Genetic information is obtained at a plurality of loci. In some embodiments, at least some, and possibly all of the loci are polymorphic. In some embodiments, all of the loci could be non-polymorphic. In some embodiments, the same loci are analyzed in both the target and non-target cells. In other embodiments the loci comprise non-polymorphic loci and also polymorphic loci; in this case, methods that utilize allelic data can be used with the allelic data measured on the polymorphic data as input, and other methods that utilize non-polymorphic data can be used with the non-polymorphic data measured on non-polymorphic loci as input, optionally including additional non-polymorphic data that can be produced by summing the allelic quantities from each of the alleles at one or more of the polymorphic loci. A number of sequence reads is obtained for each locus. In some embodiments the number of each allele at a given locus is quantitated. The quantitative data obtained can be from a combination of the loci from the target cell and the non-target cell genomes. A depth of sequencing reference value is derived from the genetic data obtained from this set of patients or in some embodiments, the depth of sequencing reference value is derived from a subset of the original set of patients. The genetic data derived from the specific chromosome or chromosome segment of interest from a selected patient in the set of patients is compared to the reference value, wherein the comparison indicates the copy number of the specific chromosome or chromosome segment of interest from the selected patient.

In some embodiments, the genetic data is obtained by sequencing. The sequencing may be performed on a high throughput parallel DNA sequencer.

In some embodiments, genetic data is obtained by simultaneously sequencing a mixture comprising DNA derived from one or more target cells and drive from one or more non-target cells to give genetic data at the set of loci from each member of the set of patients.

In some embodiments the target cells are fetal cells and non-target cells are from the mother of the fetus.

In some embodiments directed to non-invasive prenatal diagnosis, the target cells may be fetal cells and the non-target cells may be maternal cells.

In some embodiments of the invention in example of a hypothesis that may be used to select a subset of samples is the hypothesis that a specific chromosome or chromosome segment is diploid i.e. present in 2 copies. Examples of chromosomes for analysis include chromosomes 13, 18, 21, X and Y, including segments thereof. For example, the subset of samples may be chosen on the basis of having the highest likelihood that all or nearly all of the DNA in the sample originated from cells with precisely two copies of the chromosome of interest.

In some embodiments, the chromosome segment that is analyzed for copy number is selected from the group consisting of chromosome 22q11.2, chromosome 1p36, chromosome 15q11-q13, chromosome 4p16.3, chromosome 5p15.2, chromosome 17p13.3, chromosome 22q13.3, chromosome 2q37, chromosome 3q29, chromosome 9q34, chromosome 17q21.31, and the terminus of a chromosome.

In some embodiments, the set of loci are present on a selected region of a chromosome. In some embodiments, the method is performed independently for different chromosomes or chromosome segments. The only upper limit imposed on the number of patients in the set of patients is imposed by the DNA sequence generating capacity of the specific DNA sequencing technology selected (including the patient multiplexing technology, e.g. barcoding, compatible with that sequencing technology) in general there will be at least 10 patients in a patient set. In some embodiments there will be at least 24 patients in the patient set, in other embodiments there will be at least 48 patients, and in other embodiments will be at least 96 patients.

In some embodiments the target cell is a tumor and the non-target cell is a non-tumor cell.

Methods for analyzing genetic data for aneuploidy using a threshold or cutoff method are known in the art. U.S. Pat. No. 7,888,017, incorporated herein by reference, provides a method for determining fetal aneuploidy by counting the number of reads that map to a suspect chromosome and comparing it to the number of reads that map to a reference chromosome, and using the assumption that an overabundance of reads on the suspect chromosome corresponds to a triploidy in the fetus at that chromosome. Teachings provided therein can be useful in carrying out embodiments of the present invention that involve a depth of sequencing reads and a reference value. It will be understood that in this embodiment of the present invention a significant improvement over such methods is provided, because in this embodiment of the present invention the depth of sequencing reference value is derived from a subset of the original set of samples processed in parallel, using the chromosome or chromosome segment of interest in samples initially determined to be diploid in the parallel analysis, for the analysis of other samples in the parallel analysis of the set of samples. A skilled artisan with this disclosure will understand how to modify methods provided in these cited threshold method patents to perform methods provided herein.

Methods for Determining the Number of Copies of a Chromosome or Chromosome Segment in which a Set of Patients that have a Relative Fraction of DNA from the Chromosome of Interest Close to the Median of the Relative Fraction of DNA from the Chromosome of Interest from a Larger Set of Patients

One embodiment of the invention is a method for determining the number of copies of a chromosome or chromosome segment of interest in the genome of a target cell, such as fetal cell or tumor cell. Genetic data, e.g., DNA sequence data, can be obtained from a mixture of DNA comprising DNA derived from one or more target cells and DNA derived from one or more non-target cells. The target cells and non-target cells differ with respect to one another at the genomic level, as by virtue of other criteria. The term “derived” is used to indicate that the cells are the ultimate source of the DNA. Thus, for example, cell-free DNA obtained from maternal blood of pregnant woman is derived from cells and the mother's cells. The method employs a set of patients. The genetic data is obtained from each member of the patient set. Each patient in the set of patients is analyzed in parallel using essentially the same method of nucleic sequence analysis, e.g., the same amplification and sequencing reagents. Genetic information is obtained. The quantitative data obtained can be from a combination of the loci from the target cell and the non-target cell genomes. The genetic data obtained from the combination of the target cell DNA and the non-target cell DNA is used to obtain genetic data of the relative fraction of DNA (depth of sequencing read) that corresponds to the chromosome or chromosome segments of interest.

A subset of patients is selected as a control subset, by choosing those patients where the relative fraction of DNA that corresponds to the chromosome or chromosome segments of interest in the obtained genetic data for that patient is closest to the median of the relative fractions for the set of patients. This median can be obtained on a per locus basis, or in other embodiments by grouping loci into subsets of loci, which are generally in close physical proximity to one another (e.g., a genetic linkage with one another) on the chromosome or chromosome segment of interest or by looking at a chromosome or chromosome segment as a whole. A reference value is determined for the relative fraction of DNA in the obtained genetic data that corresponds to the chromosome or chromosome segments of interest from the subset of patients. The reference value for the relative fraction of DNA that corresponds to the chromosome or chromosome segment of interest is compared to the obtained genetic data from a selected patient in the set of patients, wherein the comparison produces an experimental value indicative of the presence or absence of a genetic abnormality in chromosome copy number or chromosome segment copy number in the target cell.

In some embodiments, the subset is selected as the 25, 20, 15, 10, 5, or 2% of patients or the 50, 40, 30, 25, 20, 15, 10, 5, or 2 patients whose genetic data is closest to the mean, or preferably the median for all samples.

In some embodiments, the experimental value may exceed a specific diagnostic threshold value. In some embodiments the genetic data is obtained by DNA sequencing. In some embodiments the genetic data from the set of patients is obtained by simultaneously sequencing a mixture comprising DNA derived from one or more target cells and DNA derived from one or more non-target cells to give genetic data at the set of loci from each member of the set of patients.

In some embodiments, the genetic data is obtained by sequencing. The sequencing may be performed on a high throughput parallel DNA sequencer.

In some embodiments, genetic data is obtained by simultaneously sequencing a mixture comprising DNA derived from one or more target cells and drive from one or more non-target cells to give genetic data at the set of loci from each member of the set of patients.

In some embodiments the target cells are fetal cells and non-target cells are from the mother of the fetus.

In some embodiments direct to non-invasive prenatal diagnosis, the target cells may be fetal cells and the non-target cells may be maternal cells.

In some embodiments of the invention in example of a hypothesis that may be used to determine the subset of samples is the hypothesis that a specific chromosome or chromosome segment is diploid i.e. present in 2 copies. Examples of chromosomes for analysis include chromosomes 13, 18, 21, X and Y, including segments thereof.

In some embodiments, the chromosome segment that is analyzed for copy number is selected from the group consisting of chromosome 22q11.2, chromosome 1p36, chromosome 15q11-q13, chromosome 4p16.3, chromosome 5p15.2, chromosome 17p13.3, chromosome 22q13.3, chromosome 2q37, chromosome 3q29, chromosome 9q34, chromosome 17q21.31, and the terminus of a chromosome.

In some embodiments, the set of loci are present on a selected region of a chromosome. In some embodiments, the method is performed independently for different chromosomes or chromosome segments. The only upper limited imposed on the number of patients in set of patients is imposed by the DNA sequence generating capacity of the specific DNA sequencing technology selected (including the patient multiplexing technology, e.g. barcoding, compatible with that sequencing technology) in general there will be at least 10 patients in a patient set. In some embodiments there will be at least 24 patients, and the patient set in other embodiments there will be at least 48 patients the patient set in other embodiments will be at least 96 patients in the patient set.

In some embodiments the target cell is a tumor and the non-target cell is a non-tumor cell. In some embodiments the first probability value is derived from the genetic data obtained from polymorphic loci that comprise alleles present in the target cells that are not present in the non-target cells. In some the cell free DNA comprises DNA that that has been released by apoptosis. In some embodiments the target cell is tumor cell, such tumor cells may be a malignant tumor cell.

In some embodiments, provided herein is a method for determining a presence or absence of a fetal aneuploidy in a fetus for each of a plurality of maternal blood samples obtained from a plurality of different pregnant women, said maternal blood samples comprising fetal and maternal cell-free genomic DNA, that includes the following steps:

determining a number of enumerated sequence reads corresponding to an chromosome or chromosome segment of interest for each of the plurality of samples;

determining a reference value of enumerated sequence reads from a diploid subset of between 1 and 50 samples of the plurality of samples or between 1-50% of samples of the plurality of samples having a number of enumerated sequence reads closest to the median number of enumerated sequence reads for the plurality of maternal blood samples; and

comparing the number of enumerated sequence read from at least of, or each of the other samples of the plurality of samples that are not diploid samples, to the reference value, wherein a value above a cutoff is indicate of aneuploidy in the sample, thereby determining the presence or absence of a fetal aneuploidy in the chromosome or chromosome segment of interest.

In certain embodiments the method further comprises before the determining the number of enumerated sequence reads:

- a. obtaining a fetal and maternal cell-free genomic DNA sample from each of the plurality of maternal blood samples;
- b. generating a library derived from each fetal and maternal cell-free genomic DNA sample,
- c. performing massively parallel sequencing of polynucleotide sequences of the library from the chromosome or chromosome segment of interest; and
- d. enumerating sequence reads corresponding to fetal and maternal polynucleotide sequences selected from the chromosome or chromosome segment of interest.

In certain embodiments, the reference value of enumerated sequence reads is determined from a diploid subset of between 10 and 40 samples closest to the median.

In certain embodiments, the reference value of enumerated sequence reads is determined from a diploid subset of between 15 and 40 samples closest to the median.

In other embodiments, the diploid subset can be determined by selecting a diploid subset of between 1 and 50, 2 and 40, or 10 and 40 of the samples or between 1-50%, 2-40%, 5-25%, or 5-10% of the samples having a number of enumerated sequence reads closest to the median number of enumerated sequence reads for the plurality of maternal blood samples,

In these embodiments, each library of polynucleotide sequences can include an indexing nucleotide sequence which identifies a maternal blood sample of the plurality of maternal blood samples. Such examples typically include pooling the libraries generated to produce a pool of enriched and indexed fetal and maternal non-random polynucleotide sequences;

In certain embodiments, the plurality of non-random polynucleotide sequences comprises at least 100 different non-random polynucleotide sequences selected from a first chromosome tested for being aneuploid (i.e. chromosome of interest) wherein each of said plurality of non-random polynucleotide sequences is from 10 to 1000 nucleotide bases in length,

In certain embodiments, the method further includes selectively enriching a plurality of non-random polynucleotide sequences of each fetal and maternal cell-free genomic DNA samples.

In methods of the immediately above embodiment, further background teaching can be found in U.S. Pat. No. 8,318,430, hereby incorporated by reference in its entirety.

Embodiments that Determine Aneuploidy with Improved Confidence by Utilizing a Non-Allelic Threshold Method and a Method that Determines Likelihoods

In some embodiments of the invention, improved confidence for an aneuploidy determination can be obtained by determining aneuploidy of a sample using a quantitative non-allelic threshold or cutoff method and for the same sample, determining aneuploidy using a method that determines likelihoods. If the sample is identified having aneuploidy in a chromosome or chromosome segment of interest by a threshold method and the sample is identified as having aneuploidy with high confidence using a likelihood determination for a set of hypothesis, then the sample is identified as a sample having aneuploidy at the chromosome or chromosome segment of interest for one or more target cells in a subject that is the source of the sample.

Accordingly, provided herein is a method for determining a presence or absence of aneuploidy of a chromosome or chromosome segment of interest in a test sample, comprising

- a. obtaining genetic data for the chromosome or chromosome segment of interest from a set of samples comprising the test sample, wherein the genetic data is obtained from a parallel analysis of the samples;
- b. determining whether aneuploidy is present in the test sample by a first method comprising
  - i. determining a depth of read or a proportion of reads that map to the chromosome or chromosome segment of interest;
  - ii. calculating a z-score for the depth of reads or the proportion of reads that map to the chromosome or chromosome segment of interest; and
  - iii. determining whether the z-score for the test sample is above a threshold value or whether the z-score is indicative of aneuploidy with a minimum level of confidence;
- c. determining whether aneuploidy is present in the test sample by a second method comprising
  - i. creating a plurality of ploidy hypotheses wherein each ploidy hypotheses is associated with a specific copy number for the chromosome or chromosome segment of interest,
  - ii. determining a ploidy probability value for each ploidy hypotheses, wherein the ploidy probability value indicates the likelihood that the target sample has the number of copies of the chromosome or chromosome segment of interest that is associated with the ploidy hypothesis, and
  - iii. determining which ploidy hypotheses is most likely to be correct by selecting the ploidy hypotheses with the maximum likelihood, wherein aneuploidy is determined for the chromosome or chromosome segment of interest in the test sample when both a maximum likelihood ploidy hypothesis is an aneuploidy and a z-score is above the threshold from step Bii.

In the above method, step B is carried out by a non-allelic threshold method and step C is carried out using a likelihood determining method. Methods are known in the art for carrying out a non-allelic threshold analysis, especially for NIPT. For example, U.S. Pat. Nos. 7,888,017 and 8,318,430, incorporated in their entirety herein by reference, provide methods for determining fetal aneuploidy by counting the number of reads that map to a suspect chromosome and comparing it to the number of reads that map to a reference chromosome, and using the assumption that an overabundance of reads on the suspect chromosome corresponds to a triploidy in the fetus at that chromosome. Teachings provided therein can be useful in carrying out embodiments of the present invention that involve a depth of sequencing reads and a reference value.

In certain examples of methods of this embodiment, using a non-allelic threshold value, the non-allelic information can be used to calculate a sequencing depth of read for one or more loci, or in some embodiments a depth of read for an entire chromosome of segment of a chromosome. This non-allelic depth of read information can then be compared to a threshold value (i.e., a cut off value) relating to the depth of sequencing reads from a specific chrome or specific chromosome segment to a predicted chromosome copy number or chromosome segment copy number. This cutoff value can be determined in variety of ways, for example an average depth of read (normalized for the length of the specific chromosome) can be obtained from a chromosome or chromosome segment, i.e., a reference chromosome or chromosome segment, that is assumed or proven to have a specific copy with a high degree of certainty (e.g., chromosome 2 in a developing fetus can safely be assumed to be diploid). In some embodiments of the invention, the cutoff value is based on a reference chromosome or chromosome segment that is different than the chromosome or chromosome segment having the copy number that is being measured, wherein the different chromosome is assumed to have a specific copy number. In some embodiments of the invention, the cutoff value is based on a reference chromosome or chromosome segment that is the same as a chromosome or chromosome segment having the copy number that is being measured, and in certain illustrative examples, without the use of a sample known in advance of an assay, as being diploid.

In embodiments of the invention where the cutoff value is based on a reference chromosome or chromosome segment that is the same as chromosome or chromosome segment having the copy number that is being measured, sets of patients (test subjects) can be co-analyzed in a run of a high throughput DNA sequencer, so as to produce a reference value (cutoff value). This reference value can be indicative of the number of copies of a given chromosome or chromosome segment in a patient. For example, if the amount of total DNA sequence information obtained from a specific chromosome exceeds cutoff value, it may be possible to determine that the target cell contains a trisomy on a specific chromosome with a high degree confidence of a correct determination. In these examples, the same data, or a subset thereof, that is used for the non-allelic threshold method, can be used for a non-allelic or allelic likelihood method. Thus, efficiencies are gained by using the same data or a subset thereof in a parallel experiment with the same set of samples using both the non-allelic threshold analysis and the likelihood determining method.

For the likelihood method in certain examples of methods of this embodiment, the genetic data includes quantitative allelic data from a plurality of polymorphic loci in the set of loci, wherein each of the ploidy hypotheses specifies an expected distribution of quantitative allelic data at a plurality of polymorphic loci, and wherein the ploidy probability values are determined by calculating, for each of the ploidy hypotheses, the fit between the expected genetic data and the obtained genetic data. In certain examples of methods of this embodiment, the genetic data includes quantitative non-allelic data from a plurality of polymorphic loci in the set of loci, and wherein each of the ploidy hypotheses specifies an expected mean value of quantitative non-allelic data at the plurality of polymorphic loci, and wherein the ploidy probability values are determined by calculating, for each of the ploidy hypotheses, the fit between the expected genetic data and the obtained genetic data. Provided throughout this application, are methods that provide likelihoods. This includes both allelic and non-allelic methods. For example, a het-rate method provided herein or a QMM method can be used. Non-Invasive Prenatal Diagnosis (NPD)

Non-invasive prenatal diagnosis is an important technique that can be used to determine the genetic state of a fetus from genetic material that is obtained in a non-invasive manner, for example from a blood draw on the pregnant mother. The blood could be separated and the plasma isolated, and size selection can optionally be used to isolate the DNA of the appropriate length. This isolated DNA can then be measured by a number of means, such as by hybridizing to a genotyping array and measuring the fluorescence, or by sequencing on a high throughput sequencer.

In illustrative examples the methods and systems provided herein are used for NIPD, also referred to herein as non-invasive prenatal testing (NIPT). The process of non-invasive prenatal diagnosis in certain embodiments involves a number of steps. Some of the steps can include: (1) obtaining the genetic material from the fetus; (2) optionally enriching the genetic material of the fetus, ex vivo; (3) amplifying the genetic material, ex vivo; (4) optionally preferentially enriching specific loci in the genetic material, ex vivo; (5) genotyping the genetic material, ex vivo; and (6) analyzing the genotypic data, on a computer, and ex vivo. Methods to reduce to practice these and other relevant steps are disclosed herein. At least some of the method steps are not directly applied on the body. In an embodiment, the present disclosure relates to methods of treatment and diagnosis applied to tissue and other biological materials isolated and separated from the body. At least some of the method steps are executed on a computer.

Some embodiments of the present disclosure allow a clinician to determine the genetic state of a fetus that is gestating in a mother in a non-invasive manner such that the health of the baby is not put at risk by the collection of the genetic material of the fetus, and that the mother is not required to undergo an invasive procedure. Moreover, in certain aspects, the present disclosure allows the fetal genetic state to be determined with high accuracy, significantly greater accuracy than, for example, the non-invasive maternal serum analyte based screens, such as the triple test, that are in wide use in prenatal care.

The accuracy of the methods disclosed herein is a result of an informatics approach to analysis of the genotype data, as described herein. Modern technological advances have resulted in the ability to measure large amounts of genetic information from a genetic sample using such methods as high throughput sequencing and genotyping arrays. The methods disclosed herein allow a clinician to take greater advantage of the large amounts of data available, and make a more accurate diagnosis of the fetal genetic state. The details of a number of embodiments are given below. Different embodiments may involve different combinations of the aforementioned steps. Various combinations of the different embodiments of the different steps may be used interchangeably.

In one embodiment, a blood sample is taken from a pregnant mother, and the free floating DNA in the plasma of the mother's blood, which contains a mixture of both DNA of maternal origin, and DNA of fetal origin, is used to determine the ploidy status of the fetus. In one embodiment of the present disclosure, a key step of the method involves preferential enrichment of those DNA sequences in a mixture of DNA that correspond to polymorphic alleles in a way that the allele ratios and/or allele distributions remain mostly consistent upon enrichment. In one embodiment of the present disclosure, the method involves sequencing a mixture of DNA that contains both DNA of maternal origin, and DNA of fetal origin. In one embodiment of the present disclosure, a key step of the method involves using measured allele distributions to determine the ploidy state of a fetus that is gestating in a mother.

Screening Maternal Blood Containing Free Floating Fetal DNA

The methods described herein may be used to help determine the genotype of a child, fetus, or other target individual where the genetic material of the target is found in the presence of a quantity of other genetic material. In this disclosure, the discussion focuses on determining the genetic state of a fetus where the fetal DNA is found in maternal blood, but this example is not meant to limit to possible contexts that this method may be applied to. In addition, the method may be applicable in cases where the amount of target DNA is in any proportion with the non-target DNA; for example, the target DNA could make up anywhere between 0.000001 and 99.999999% of the DNA present. In addition, the non-target DNA does not necessarily need to be from one individual, or even from a related individual, as long as genetic data from non-target individual(s) is known. In one embodiment of the present disclosure, the method can be used to determine genotypic data of a fetus from maternal blood that contains fetal DNA. It may also be used in a case where there are multiple fetuses in the uterus of a pregnant woman, or where other contaminating DNA may be present in the sample, for example from other already born siblings.

This technique may make use of the phenomenon of fetal blood cells gaining access to maternal circulation through the placental villi. Ordinarily, only a very small number of fetal cells enter the maternal circulation in this fashion (not enough to produce a positive Kleihauer-Betke test for fetal-maternal hemorrhage). The fetal cells can be sorted out and analyzed by a variety of techniques to look for particular DNA sequences, but without the risks that these latter two invasive procedures inherently have. This technique may also make use of the phenomenon of free floating fetal DNA gaining access to maternal circulation by DNA release following apoptosis of placental tissue where the placental tissue in question contains DNA of the same genotype as the fetus. The free floating DNA found in maternal plasma has been shown to contain fetal DNA in proportions as high as 30-40% fetal DNA.

In one embodiment of the present disclosure, blood may be drawn from a pregnant woman. Research has shown that maternal blood may contain a small amount of free floating DNA from the fetus, in addition to free floating DNA of maternal origin. In addition, there also may be enucleated fetal blood cells containing DNA of fetal origin, in addition to many blood cells of maternal origin, which typically do not contain nuclear DNA. There are many methods known in the art to isolate fetal DNA, or create fractions enriched in fetal DNA. For example, chromatography has been show to create certain fractions that are enriched in fetal DNA.

Once the sample of maternal blood, plasma, or other fluid, drawn in a relatively non-invasive manner, and that contains an amount of fetal DNA, either cellular or free floating, either enriched in its proportion to the maternal DNA, or in its original ratio, is in hand, one may genotype the DNA found in said sample. The method described herein can be used to determine genotypic data of the fetus. For example, it can be used to determine the ploidy state at one or more chromosomes, it can be used to determine the identity of one or a set of SNPs, including insertions, deletions, and translocations. It can be used to determine one or more haplotypes, including the parent of origin of one or more genotypic features.

Note that this method will work with any nucleic acids that can be used for any genotyping and/or sequencing methods, such as the ILLUMINA INFINIUM ARRAY platform, AFFYMETRIX GENECHIP, ILLUMINA GENOME ANALYZER, HiSEQ or MiSEQ, LIFE TECHNOLGIES’ SOLiD SYSTEM, or Ion Torrent Person Genome Machine or Proton. This includes extracted free-floating DNA from plasma or amplifications (e.g. whole genome amplification, PCR) of the same; genomic DNA from other cell types (e.g. human lymphocytes from whole blood) or amplifications of the same. For preparation of the DNA, any extraction or purification method that generates genomic DNA suitable for the one of these platforms will work as well. In one embodiment, storage of the samples may be done in a way that will minimize degradation (e.g. at −20 C or lower).

Methods for Determining the Number of Copies of a Chromosome or Chromosome Segment of Interest by Combining Allelic and Non-Allelic Genetic Data

Other embodiments of the invention include methods for determining the number of copies of a chromosome or chromosome segment of interest in the genome of a target cell, such as fetal cell or tumor cell. Genetic data, e.g., DNA sequence data, can be obtained from a mixture of DNA comprising DNA derived from one or more target cells and DNA derived from one or more non-target cells. The method can employ a single patient or a set of patients. The genetic data is obtained from a patient. Genetic information is obtained at a plurality of loci. At least some, and possible all of the loci are polymorphic. The same loci are analyzed in both the target and non-target cells. A number of sequence reads is obtained for each locus. The number of sequence reads at each allele at a given locus is quantitated. The quantitative data obtained can be from a combination of the loci from the target cell and the non-target cell genomes. The collected data is then tested against a plurality of copy number hypotheses, i.e., the copy number of the chromosome or chromosome segment of interest. A first probability value is calculated for each hypothesis i.e., the probability that the hypothesis is either true or false given the measured genetic data. Thus the likelihood that the genome of the target cell has the number of copies of the chromosome or chromosome segment of interest specified by the hypothesis is determined. This first probability value is obtained using the allelic data. A second probability value is calculated for each hypothesis i.e., the probability that the hypothesis is either true or false given the measured genetic data. Thus the likelihood that the genome of the target cell has the number of copies of the chromosome or chromosome segment of interest specified by the hypothesis is determined. This second probability value is obtained using the non-allelic data. For each hypothesis, the first probability value and the second probability value can be combined, e.g., through multiplication, to give a combined probability indicating the likelihood that the genome of the target cell has the number of copies of the chromosome or chromosome segment that is associated with the hypothesis. The number of copies of the chromosome or chromosome segment of interest in the genome of the target cell can be determined by selecting the number of copies of the chromosome or chromosome segment that is associated with the hypothesis with the greatest combined probability is used to make the determination of the chromosome or chromosome segment copy number in the sample of interest. In some embodiments wherein the genetic data is obtained from cell free DNA obtained from the blood of a pregnant woman, the hypothesis can include a condition that the mother is carrying multiple fetuses, e.g., twins.

Accordingly, in some embodiments, genetic data is obtained by simultaneously sequencing a mixture comprising DNA derived from one or more target cells and derived from one or more non-target cells to give genetic data at the set of loci from each member of the set of patients. In some embodiments the target cells are fetal cells and non-target cells are from the mother of the fetus. That is, in some embodiments directed to non-invasive prenatal diagnosis, the target cells may be fetal cells and the non-target cells may be maternal cells. In some embodiments of the invention in example of a hypothesis that may be used to select the subset of patients may be the hypothesis that a specific chromosome or chromosome segment is diploid i.e. present in 2 copies. Examples of chromosomes for analysis include chromosomes 13, 18, 21, X and Y, including segments thereof. In some embodiments, the chromosome segment that is analyzed for copy number is selected from the group consisting of chromosome 22q11.2, chromosome 1p36, chromosome 15q11-q13, chromosome 4p16.3, chromosome 5p15.2, chromosome 17p13.3, chromosome 22q13.3, chromosome 2q37, chromosome 3q29, chromosome 9q34, chromosome 17q21.31, and the terminus of a chromosome.

In some embodiments, the set of loci are present on a selected region of a chromosome. In some embodiments, the method is performed independently for different chromosomes or chromosome segments. The only upper limited imposed on the number of patients in set of patients is imposed by the DNA sequence generating capacity of the specific DNA sequencing technology selected (including the patient multiplexing technology, e.g. barcoding, compatible with that sequencing technology) in illustrative embodiments there will be at least 10 patients in a patient set. In some embodiments there will be at least 24 patients, and the patient set in other embodiments there will be at least 48 patients the patient set in other embodiments will be at least 96 patients in the patient set.

Methods of Determining the Number of Copies of a Chromosome or Chromosome Segment Employing Hypotheses that are Tested Using a Combination of the Allelic and Non-Allelic Data

Embodiments include methods for determining the number of copies of a chromosome or chromosome segment of interest in the genome of a target cell in which genetic data is obtained from DNA derived from target cells and DNA derived from non-target cells, wherein the genetic data comprises (i) quantitative allelic data from a plurality of polymorphic loci and (ii) quantitative non-allelic data from a plurality of polymorphic and/or non-polymorphic loci. The method includes the step of creating a plurality of hypotheses wherein each hypothesis is associated with a specific copy number for the chromosome or chromosome segment in the genome of the target cell. A probability value is calculated for each hypothesis, wherein the probability value indicates the likelihood that the genome of the target cell has the number of copies of the chromosome or chromosome segment that is associated with the hypothesis, and wherein the first probability value is derived from the allelic data and the non-allelic data obtained from at least one first locus. For example, the hypothesis may be tested using a model that incorporates both allelic data and non-allelic data, thereby obtaining a probability value. Each calculated probability value can be combined to give a combined probability indicating the likelihood that the genome of the target cell has the number of copies of the chromosome or chromosome segment that is associated with the hypothesis. The number of copies of the chromosome or chromosome segment of interest in the genome of the target cell is determined by selecting the number of copies of the chromosome or chromosome segment that is associated with the hypothesis with the greatest probability. In some embodiments wherein the genetic data is obtained from cell free DNA obtained from the blood of a pregnant woman, the hypothesis can include a condition that the mother is carrying multiple fetuses, e.g., twins.

In some embodiments the probability value for each hypothesis is obtained from allelic and non-allelic data obtained from a single locus. In some embodiments the allelic data is tested on a model based on a distribution of possible allelic ratios associated with each hypothesis. In some embodiments the probability values for each hypothesis are separately determined for genetic data from at least 1000 polymorphic loci. In some embodiments the step of calculating a probability value for each hypothesis comprises the steps of (1) modeling, for each hypothesis, the expected genetic data from the DNA derived from the target cell based on the obtained genetic data comprising DNA derived from non-target cells, (2) comparing, for each hypothesis, the modeled genetic data from the DNA derived from the target cell and the obtained genetic data from DNA derived from the target cell, and (3) calculating a probability value, for each hypothesis, based on the difference between the modeled genetic data from the DNA derived from the target cell and the obtained genetic data from DNA derived from the target cell. In some embodiments the non-target cells originate from a parent of an individual from which the target cell originated, and the modeling of the expected genetic data further comprises determining the expected genetic data of the target cell using the rules of Mendelian inheritance an adjusting the expected genetic data of the target cell to correct for biases in the system as disclosed herein. Examples of such a system biases include amplification bias, sequencing bias, processing bias, enrichment bias, and combinations thereof. The nature of such biases may vary in accordance with the specific amplification technology, sequencing technology, processing, enrichment technology, etc. selected for implementation of the specific embodiment. In some embodiments the target cell is from a fetus, and wherein the expected genetic data comprises genetic data from the parent of the fetus and genetic data from the fetus. In some embodiments the modeling of the genetic data comprises the steps of predicting, for each locus, an expected distribution of allelic measurements at that locus, and predicting, for each locus, an expected relative quantity of DNA (depth of read) at that locus. In some embodiment the prediction of an expected distribution of allelic measurements can takes into account the linkage and cross-overs between different loci on the genome. In some embodiments, the expected distribution is a binomial distribution.

Different Implementations of the Presently Disclosed Embodiments

FIG. 2 shows an example system architecture 200 useful for performing embodiments of the present invention. System architecture 200 includes an analysis platform 208 connected to one or more laboratory information systems (“LISs”) 204. Analysis platform 208 may alternatively or additionally be connected directly to LIS 206. As shown in FIG. 2, analysis platform 208 may be connected to LIS 206 over a network 202. Network 202 may include one or more networks of one or more network types, including any combination of LAN, WAN, the Internet, etc. Network 202 may encompass connections between any or all components in system architecture 200. In an embodiment, analysis platform 208 analyzes genetic data provided by LIS 206 in a software-as-a-service model, where LIS 206 is a third-party LIS, while analysis platform 208 analyzes genetic data provided by LIS 204 in a full-service or in-house model, where LIS 204 and analysis platform 208 are controlled by the same party. In an embodiment where analysis platform 208 is providing information over network 202, analysis platform 208 may be a server.

In an example embodiment, laboratory information system 206 includes one or more public or private institutions that collect, manage, and/or store genetic data. A person having skill in the relevant art(s) would understand that methods and standards for securing genetic data are known and can be implemented using various information security techniques and policies, e.g., username/password, Transport Layer Security (TLS), Secure Sockets Layer (SSL), and/or other cryptographic protocols providing communication security.

In an example embodiment, system architecture 200 operates as a service-oriented architecture and uses a client-server model that would be understood by one of skill in the relevant art(s) to enable various forms of interaction and communication between LIS 206 and analysis platform 208. System architecture 200 may be distributed over various types of networks 202 and/or may operate as cloud computing architecture. Cloud computing architecture may include any type of distributed network architecture. By way of example and not of limitation, cloud computing architecture is useful for providing software as a service (SaaS), infrastructure as a service (IaaS), platform as a service (PaaS), network as a service (NaaS), data as a service (DaaS), database as a service (DBaaS), backend as a service (BaaS), test environment as a service (TEaaS), API as a service (APIaaS), integration platform as a service (IPaaS) etc.

In an example embodiment, LISs 204 and 206 each include a computer, device, interface, etc. or any sub-system thereof. In an embodiment, LISs 204 and 206 are high-throughput DNA sequencers that conduct genetic analysis and provide such genetic data to analysis platform 208. In an embodiment, the high-throughput DNA sequencers contain PCR amplifiers. LISs 204 and 206 may include an operating system (OS), applications installed to perform various functions such as, for example, access to and/or navigation of data made accessible locally, in memory, and/or over network 202. In an embodiment, LIS 204 accesses analysis platform 208 through an application programming interface (“API”). LIS 204 may also include one or more native applications that may operate independently of an API.

In an example embodiment, analysis platform 208 includes one or more of an input processor 212, a hypothesis manager 214, a modeler 216, a bias correction unit 218, a machine learning unit 220, and an output processor 218. Input processor 212 receives and processes inputs from LISs 204 and/or 206. Processing may include but is not limited to operations such as parsing, transcoding, translating, adapting, or otherwise handling any input received from LISs 204 and/or 206. Inputs may be received via one or more streams, feeds, databases, or other sources of data, such as may be made accessible by LISs 204 and 206.

In an example embodiment, hypothesis manager 214 is configured to receive the inputs passed from input processor 212 in a form ready to be processed in accordance with hypotheses for genetic analysis that are represented as models and/or algorithms. Such models and/or algorithms may be stored in hypothesis database 224. In an embodiment, hypothesis database 224 stores such information in table format. Data from hypothesis database 224 may be used by modeler 216 to generate probabilities, for example, using the methods disclosed herein such as, for example, the non-allelic quantitative method or the allelic het rate method, and the like. Data used to derive and populate such strategy models and/or algorithms are available to hypothesis manager 214 via, for example, genetic data source 210 via LIS 204 or 206. Genetic data source 210 may include, for example, assays of samples to be analyzed by LIS 204 or 206. Hypothesis manager 214 may be configured to formulate hypotheses based on, for example, the variables required to populate its models and/or algorithms. Alternatively, hypotheses may be provided from a user and stored in hypothesis database 224. Models and/or algorithms, once populated, may be used by modeler 216 to compare one or more hypotheses to observed genetic data as described above. Modeler 216 may also develop bias models as described in various embodiments above. Bias errors, such as amplification errors and the like, may be corrected by bias correction unit 218 through performance of the bias correction mechanisms described herein.

Hypothesis manager 214 may select a particular value, range of values, or estimate based on a most-likely hypothesis as an output as described above. Modeler 216 may operate in accordance with models and/or algorithms trained by machine learning unit 220. For example, machine learning unit 220 may develop such models and/or algorithms by applying a classification algorithm, such as a Bayes classification algorithm, as described above to genetic data to identify diploid samples to be used as a reference set. Modeler 216 can then use the identified reference set to estimate, for example, copy numbers for original or bias-corrected (adjusted or normalized) genetic data. Modeler 216 can compare expected data (based on each hypothesis) with observed data to generate a probability value for each hypothesis as compared to the observed data for a target sample.

Once hypothesis manager 214 receives probability values for each hypothesis for a given target, hypothesis manager 214 can select a most-likely hypothesis as an output result. Such output may be returned to the particular LIS 204 or 206 requesting the information by output processor 222. Such information can then be transmitted for individual patient samples to their respective representatives.

Various aspects of the disclosure can be implemented on a computing device by software, firmware, hardware, or a combination thereof. FIG. 3 illustrates an example computer system 300 in which the contemplated embodiments, or portions thereof, can be implemented as computer-readable code. Various embodiments are described in terms of this example computer system 300. For example, analysis platform 208 and databases 210 and 224 described above may be implemented in system 300. In addition or alternatively, the various methods described herein, such as method 100 and the additional algorithms used therein, may be executed by a computer processing system such as system 300.

Processing tasks in the embodiment of FIG. 3 are carried out by one or more processors 302. However, it should be noted that various types of processing technology may be used here, including programmable logic arrays (PLAs), application-specific integrated circuits (ASICs), multi-core processors, multiple processors, or distributed processors. Additional specialized processing resources such as graphics, multimedia, or mathematical processing capabilities may also be used to aid in certain processing tasks. These processing resources may be hardware, software, or an appropriate combination thereof. For example, one or more of processors 302 may be a graphics-processing unit (GPU). In an embodiment, a GPU is a processor that is a specialized electronic circuit designed to rapidly process mathematically intensive applications on electronic devices. The GPU may have a highly parallel structure that is efficient for parallel processing of large blocks of data, such as mathematically intensive data. Alternatively or in addition, one or more of processors 302 may be a special parallel processing without the graphics optimization, such parallel processors performing the mathematically intensive functions described herein. One or more of processors 302 may include a processing accelerator (e.g., DSP or other special-purpose processor).

Computer system 300 also includes a main memory 330, and may also include a secondary memory 340. Main memory 330 may be a volatile memory or non-volatile memory, and divided into channels. Secondary memory 340 may include, for example, non-volatile memory such as a hard disk drive 350, a removable storage drive 360, and/or a memory stick. Removable storage drive 360 may comprise a floppy disk drive, a magnetic tape drive, an optical disk drive, a flash memory, or the like. The removable storage drive 360 reads from and/or writes to a removable storage unit 370 in a well-known manner. Removable storage unit 370 may comprise a floppy disk, magnetic tape, optical disk, etc. which is read by and written to by removable storage drive 360. As will be appreciated by persons skilled in the relevant art(s), removable storage unit 370 includes a computer usable storage medium having stored therein computer software and/or data.

In alternative implementations, secondary memory 340 may include other similar means for allowing computer programs or other instructions to be loaded into computer system 300. Such means may include, for example, a removable storage unit 370 and an interface (not shown). Examples of such means may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 370 and interfaces which allow software and data to be transferred from the removable storage unit 370 to computer system 300.

Computer system 300 may also include a memory controller 375. Memory controller 375 controls data access to main memory 330 and secondary memory 340. In some embodiments, memory controller 375 may be external to processor 310, as shown in FIG. 3. In other embodiments, memory controller 375 may also be directly part of processor 310. For example, many AMD™ and Intel™ processors use integrated memory controllers that are part of the same chip as processor 310 (not shown in FIG. 3).

Computer system 300 may also include a communications and network interface 380. Communication and network interface 380 allows software and data to be transferred between computer system 300 and external devices. Communications and network interface 380 may include a modem, a communications port, a PCMCIA slot and card, or the like. Software and data transferred via communications and network interface 380 are in the form of signals which may be electronic, electromagnetic, optical, or other signals capable of being received by communication and network interface 380. These signals are provided to communication and network interface 380 via a communication path 385. Communication path 385 carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link or other communications channels.

The communication and network interface 380 allows the computer system 300 to communicate over communication networks or mediums such as LANs, WANs the Internet, etc. The communication and network interface 380 may interface with remote sites or networks via wired or wireless connections.

In this document, the terms “computer program medium,” “computer-usable medium” and “non-transitory medium” are used to generally refer to tangible (i.e. non-signal) media such as removable storage unit 370, removable storage drive 360, and a hard disk installed in hard disk drive 350. Signals carried over communication path 385 can also embody the logic described herein. Computer program medium and computer usable medium can also refer to memories, such as main memory 330 and secondary memory 340, which can be memory semiconductors (e.g. DRAMs, etc.). These computer program products are means for providing software to computer system 300.

Computer programs (also called computer control logic) are stored in main memory 330 and/or secondary memory 340. Computer programs may also be received via communication and network interface 380. Such computer programs, when executed, enable computer system 300 to implement embodiments as discussed herein. In particular, the computer programs, when executed, enable processor 310 to implement the disclosed processes. Accordingly, such computer programs represent controllers of the computer system 300. Where the embodiments are implemented using software, the software may be stored in a computer program product and loaded into computer system 300 using removable storage drive 360, interfaces, hard drive 350 or communication and network interface 380, for example.

The computer system 300 may also include input/output/display devices 390, such as keyboards, monitors, pointing devices, touchscreens, etc.

It should be noted that the simulation, synthesis and/or manufacture of various embodiments may be accomplished, in part, through the use of computer readable code, including general programming languages (such as C or C++), hardware description languages (HDL) such as, for example, Verilog HDL, VHDL, Altera HDL (AHDL), or other available programming tools. This computer readable code can be disposed in any known computer-usable medium including a semiconductor, magnetic disk, optical disk (such as CD-ROM, DVD-ROM). As such, the code can be transmitted over communication networks including the Internet.

The presently disclosed embodiments can be implemented advantageously in one or more computer programs that are executable and/or interpretable on system 300. Each computer program can be implemented in a high-level procedural or object-oriented programming language, or in assembly or machine language if desired; and in any case, the language can be a compiled or interpreted language. A computer program may be deployed in any form, including as a stand-alone program, or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed or interpreted on one computer or on multiple computers at one site, (that is, system 300 may be distributed locally) or distributed across multiple sites and interconnected by a communication network (that is, system 300 may be distributed across a network). The embodiments are also directed to computer program products comprising software stored on any computer-usable medium. Such software, when executed in one or more data processing devices, causes a data processing device(s) to operate as described herein. Embodiments employ any computer-usable or -readable medium, and any computer-usable or -readable storage medium known now or in the future. Examples of computer-usable or computer-readable mediums include, but are not limited to, primary storage devices (e.g., any type of random access memory), secondary storage devices (e.g., hard drives, floppy disks, CD ROMS, ZIP disks, tapes, magnetic storage devices, optical storage devices, MEMS, nano-technological storage devices, etc.), and communication mediums (e.g., wired and wireless communications networks, local area networks, wide area networks, intranets, etc.). Computer-usable or computer-readable mediums can include any form of transitory (which include signals) or non-transitory media (which exclude signals). Non-transitory media comprise, by way of non-limiting example, the aforementioned physical storage devices (e.g., primary and secondary storage devices).

Any of the methods described herein may include the output of data in a physical format, such as on a computer screen, or on a paper printout. In explanations of any embodiments elsewhere in this document, it should be understood that the described methods may be combined with the output of the actionable data in a format that can be acted upon by a physician. In addition, the described methods may be combined with the actual execution of a clinical decision that results in a clinical treatment, or the execution of a clinical decision to make no action. Some of the embodiments described in the document for determining genetic data pertaining to a target individual may be combined with the decision to select one or more embryos for transfer in the context of IVF, optionally combined with the process of transferring the embryo to the womb of the prospective mother. Some of the embodiments described in the document for determining genetic data pertaining to a target individual may be combined with the notification of a potential chromosomal abnormality, or lack thereof, with a medical professional, optionally combined with the decision to abort, or to not abort, a fetus in the context of prenatal diagnosis. Some of the embodiments described herein may be combined with the output of the actionable data, and the execution of a clinical decision that results in a clinical treatment, or the execution of a clinical decision to make no action.

Hypotheses

A hypothesis can refer to a possible genetic state. It can refer to a possible ploidy state. It can refer to a possible allelic state. A set of hypotheses refers to a set of possible genetic states. In some embodiments, a set of hypotheses are designed such that one hypothesis from the set will correspond to the actual genetic state of any given individual. In some embodiments, a set of hypotheses are designed such that every possible genetic state can be described by at least one hypothesis from the set. In some embodiments of the present disclosure, one aspect of the method is to determine which hypothesis corresponds to the actual genetic state of the individual in question.

A “copy number hypothesis,” also called a “ploidy hypothesis,” or a “ploidy state hypothesis,” may refer to a hypothesis concerning a possible ploidy state for a given chromosome, or chromosome segment, in the target individual. It may also refer to the ploidy state at more than one of the chromosomes in the individual. A set of copy number hypotheses may refer to a set of hypotheses where each hypothesis corresponds to a different possible ploidy state in an individual. A set of hypotheses in certain examples is a set of possible ploidy states, a set of possible parental haplotype contributions, a set of possible fetal DNA percentages in the mixed sample, or combinations thereof.

A normal individual contains one of each chromosome from each parent. However, due to errors in meiosis and mitosis, it is possible for an individual to have 0, 1, 2, or more of a given chromosome from each parent. In practice, it is rare to see more that two of a given chromosomes from a parent. Certain embodiments of the invention, especially those involving NIPT, consider the possible hypotheses where 0, 1, or 2 copies of a given chromosome come from a parent. In some embodiments, for a given chromosome, there are nine possible hypotheses: the three possible hypothesis concerning 0, 1, or 2 chromosomes of maternal origin, multiplied by the three possible hypotheses concerning 0, 1, or 2 chromosomes of paternal origin. Let (m,f) refer to the hypothesis where m is the number of a given chromosome inherited from the mother, and f is the number of a given chromosome inherited from the father. Therefore, the nine hypotheses are (0,0), (0,1), (0,2), (1,0), (1,1), (1,2), (2,0), (2,1), and (2,2). These may also be written as H₀₀, H₀₁, H₀₂, H₁₀, H₁₂, H₂₀, H₂₁, and H₂₂. The different hypotheses correspond to different ploidy states. For example, (1,1) refers to a normal disomic chromosome; (2,1) refers to a maternal trisomy, and (0,1) refers to a paternal monosomy. Especially in the context of NIPT, two of these hypothesis are not feasible 0,0 and 2,2. In some embodiments, the case where two chromosomes are inherited from one parent and one chromosome is inherited from the other parent may be further differentiated into two cases: one where the two chromosomes are identical (matched copy error), and one where the two chromosomes are homologous but not identical (unmatched copy error). In these embodiments, there are sixteen possible hypotheses. It should be understood that it is possible to use other sets of hypotheses, and a different number of hypotheses.

Ploidy hypothesis are created during exemplary methods of the invention that use methods, algorithms, techniques, or subroutines that provide likelihoods. For example, in certain illustrative examples of embodiments for determining the presence or absence of aneuploidy, a set of ploidy hypotheses is created for each sample in the set of samples, wherein each hypothesis is associated with a specific copy number for the chromosome or chromosome segment of interest in a genome of a sample. For example, in embodiments that use quantitative non-allelic data, such as the QMM disclosed herein, the hypothesis can provide estimates of sample parameters, such as the variability in the starting quantity of DNA in a sample due to pipetting variability or errors or other measurement errors, which can be used to normalize the measurements (i.e. measured genetic data) at some or all of the positions on some or all of the chromosomes or chromosome segments of interest in that sample, and then a test statistic can be computed as the variance-weighted mean of these normalized measurements.. Thus, in certain embodiments, the hypothesis provides a variance-weighted mean test statistic for a given ploidy condition. The expectation and variance of the test statistic is calculated under each of the chromosome copy number hypothesis to form Gaussian models for the maximum likelihood estimate. For example, a set of hypothesis in an NIPT analysis for a non-allelic quantitative analysis, can provide a variance-weighted mean test statistic for a disomy or a trisomy at one or more of chromosomes 13, 18, and 21. In exemplary embodiments of the present invention where the chromosome or chromosome segment of interest can be used to set sample parameters, the hypothesis can be a joint hypothesis on the copy numbers of some or all of the chromosomes, for example chromosome 13, 18, and 21. This is further discussed below with regards to a quantitative method that does not use non-target reference chromosomes.

In some embodiments of the present disclosure, the ploidy hypothesis may refer to a hypothesis concerning which chromosome from other related individuals correspond to a chromosome found in the target individual's genome. Some embodiments utilize the fact that related individuals can be expected to share haplotype blocks, and using measured genetic data from related individuals, along with a knowledge of which haplotype blocks match between the target individual and the related individual, it is possible to infer the correct genetic data for a target individual with higher confidence than using the target individual's genetic measurements alone. As such, in some embodiments, the ploidy hypothesis may concern not only the number of chromosomes, but also which chromosomes in related individuals are identical, or nearly identical, with one or more chromosomes in the target individual.

An allelic hypothesis, or an “allelic state hypothesis” may refer to a hypothesis concerning a possible allelic state of a set of alleles. In some embodiments, the technique, algorithm, or method used utilizes the fact that, as described above, related individuals may share haplotype blocks, which may help the reconstruction of genetic data that was not perfectly measured. An allelic hypothesis can also refer to a hypothesis concerning which chromosomes, or chromosome segments, if any, from a related individual correspond genetically to a given chromosome from an individual. The theory of meiosis tells us that each chromosome in an individual is inherited from one of the two parents, and this is a nearly identical copy of a parental chromosome. Therefore, if the haplotypes of the parents are known, that is, the phased genotype of the parents, then the genotype of the child may be inferred as well. (The term child, here, is meant to include any individual formed from two gametes, one from the mother and one from the father.) In one embodiment of the present disclosure, the allelic hypothesis describes a possible allelic state, at a set of alleles, including the haplotypes, at a chromosome or chromosome segment of interest, as well as which chromosomes from related individuals may match the chromosome(s) which contain the set of alleles.

Once the set of hypotheses have been defined the algorithms operate on the input genetic data and output a determined statistical probability for each of the hypotheses under consideration. For example, in an embodiment of the invention the method determines a probability value by comparing the genetic data to an expected result for each hypothesis, wherein the probability value indicates the likelihood that a sample has a certain number of copies of the chromosome or chromosome segment that is associated with the hypothesis.

The probabilities of the various hypotheses can be determined by mathematically calculating, for each of the various hypotheses, the value that the probability equals, as stated by one or more of the expert techniques, algorithms, and/or methods described elsewhere in this disclosure, using the relevant genetic data as input.

Once the probabilities of the different hypotheses are estimated, as determined by a plurality of techniques, they may be combined. This may entail, for each hypothesis, multiplying the probabilities as determined by each technique. The product of the probabilities of the hypotheses may be normalized. Note that one ploidy hypothesis refers to one possible ploidy state for a chromosome.

The process of “combining probabilities,” also called “combining hypotheses,” or combining the results of expert techniques, is a concept that should be familiar to one skilled in the art of linear algebra. In exemplary methods of the present invention, two methods are utilized for determining the presence or absence of aneuploidy or for determining the number of copies of a chromosome that each provide a probability. In certain illustrative embodiments, the confidence of the determination is increased by combining the confidences that are selected for each method. For example, a confidence for a first method that performs a quantitative allelic analysis, can be combined with a confidence from a second method that performs a quantitative non-allelic analysis.

In cases where the likelihoods are determined by a first method in a way that is orthogonal, or unrelated, to the way in which a likelihood is determined for a second method, combining the likelihoods is straightforward and can be done by multiplication and normalization, or by using a formula such as:

R_comb=R₁R₂/[R₁R₂+(1−R₁)(1−R₂)]

Where R_combis the combined likelihood, and R₁and R₂are the individual likelihoods. In cases where the first and the second methods are not orthogonal, that is, where there is a correlation between the two methods, the likelihoods may still be combined, though the mathematics may be more complex.

In some embodiments, the 1^stprobability and the 2^ndprobability are weighted differently prior to the step of combining the probabilities. In some embodiments the 1^stprobability and the 2^ndprobability are considered independent events for the purposes of the step of combining the two probability values. In some embodiments the 1^stprobability and the 2^ndprobability are considered dependent events for the purposes of the step of combining the two probability values. In some embodiments, the method further comprises obtaining a third probability value where in the third probability value indicates the likelihood that the genome of the target has the number of copies of the chromosome or chromosome segment associated with a specific hypothesis wherein the third probability value is derived from information that is a non-non-genetic clinical assay. Many non-genetic clinical assays have a known probabilistic correlation with a specific chromosome copy number or chromosome segment copy number. For each hypothesis, the combined first and second probability values may be combined with the third probability value to give a combined probability value indicating the likelihood that the genome of the target cell has the number of copies of the chromosome or chromosome segment of interest, wherein that number is associated with the specific hypothesis. An examples of such non-genetic clinical assays include a nuchal translucency measurement. In some embodiments the non-genetic clinical assay is selected from the group consisting of measurements of: beta-human chorionic gonadotropin, pregnancy associated plasma protein A, estriol, inhibin-A, and alpha-fetoprotein.

Not to be limited by theory, the following disclosure further teaches how to combine probabilities. One possible way to combine probabilities is as follows: When an expert technique is used to evaluate a set of hypotheses given a set of genetic data, the output of the method is a set of probabilities that are associated, in a one-to-one fashion, with each hypothesis in the set of hypotheses. When a set of probabilities that were determined by a first expert technique, each of which are associated with one of the hypotheses in the set, are combined with a set of probabilities that were determined by a second expert technique, each of which are associated with the same set of hypotheses, then the two sets of probabilities are multiplied. This means that, for each hypothesis in the set, the two probabilities that are associated with that hypothesis, as determined by the two expert methods, are multiplied together, and the corresponding product is the output probability. This process may be expanded to any number of expert techniques. If only one expert technique is used, then the output probabilities are the same as the input probabilities. If more than two expert techniques are used, then the relevant probabilities may be multiplied at the same time. The products may be normalized so that the probabilities of the hypotheses in the set of hypotheses sum to 100%.

In some embodiments, if the combined probabilities for a given hypothesis are greater than the combined probabilities for any of the other hypotheses, then it may be considered that that hypothesis is determined to be the most likely. In some embodiments, a hypothesis may be determined to be the most likely, and the ploidy state, or other genetic state, may be called if the normalized probability is greater than a threshold. In one embodiment, this means that the number and identity of the chromosomes that are associated with that hypothesis may be called as the ploidy state. In one embodiment, this means that the identity of the alleles that are associated with that hypothesis are called as the allelic state. In some embodiments, the threshold is between about 50% and about 80%. In some embodiments the threshold is between about 80% and about 90%. In some embodiments the threshold is between about 90% and about 95%. In some embodiments the threshold is between about 95% and about 99%. In some embodiments the threshold is between about 99% and about 99.9%. In some embodiments the threshold is above 99.9%. In other embodiments, a set of rules are used for a final risk call for a sample wherein a combined probability threshold is set, but different scenarios can be considered and could override the results of the probability threshold, or used to enhance the calling ability of the combined probability. For example, if there is a wide disparity in probabilities for a given ploidy hypothesis, further analysis can be performed for example, to determine whether there was an error in one of the methods.

Some embodiments of the invention employ the step of producing a subset of patients from a larger set of patients. The original set of patients is used as the source of target cells and non-target cells for analysis. In some embodiments of the invention, the DNA samples obtained from the patients are modified using standard molecular biology techniques in order to be sequenced on the DNA sequencer. In some embodiments the technique will involve forming a genetic library containing priming sites for the DNA sequencing procedure. In some embodiments, a plurality of loci may be targeted for site specific amplification. In some embodiments the targeted loci are polymorphic loci, e.g., a single nucleotide polymorphisms. In embodiments implying the formation of genetic libraries, libraries may be encoded using a DNA sequence that is specific for the patient, e.g. barcoding, thereby permitting multiple patients to be analyzed in a single flow cell (or flow cell equivalent) of a high throughput DNA sequencer. Although the samples are mixed together in the DNA sequencer flow cell, the determination of the sequence of the barcode permits identification of the patient source that contributed the DNA that had been sequenced.

It will be appreciated by those of ordinary skill in the art that in those embodiments of the invention in which the target DNA is not enriched for specific loci, the entire genome may be sequenced, although assembly of the sequence into a complete genome is not required for use of the subject methods. Information about specific loci may be readily determined from all genome sequencing.

In one embodiment of the present disclosure, a confidence may be calculated on the accuracy of the determination of the ploidy state of the fetus. In one embodiment, the confidence of the hypothesis of greatest likelihood (H_major) may be calculated as (1−H_major/Σ(all H). It is possible to determine the confidence of a hypothesis if the distributions of all of the hypotheses are known. It is possible to determine the distribution of all of the hypotheses if the parental genotype information is known. It is possible to calculate a confidence of the ploidy determination if the knowledge of the expected distribution of data for the euploid fetus and the expected distribution of data for the aneuploid fetus are known. It is possible to calculate these expected distributions if the parental genotype data are known. In one embodiment one may use the knowledge of the distribution of a test statistic around a normal hypothesis and around an abnormal hypothesis to determine both the reliability of the call as well as refine the threshold to make a more reliable call. This is particularly useful when the amount and/or percent of fetal DNA in the mixture is low. It will help to avoid the situation where a fetus that is actually aneuploid is found to be euploid because a test statistic, such as the Z statistic, does not exceed a threshold that is made based on a threshold that is optimized for the case where there is a higher percent fetal DNA.

An Example of a Quantitative Non-Allelic Maximum Likelihood Method (“QMM”)

An example of a quantitative method that may be used to determine the number of copies of a chromosome of interest in a target individual is provided here. Note that this example involves normalization of the target chromosome data using a reference chromosome that is the same as the target chromosome (i.e. chromosome of interest), but found in other samples processed in a similar or identical manner. The instant method is described in the context of non-invasive prenatal aneuploidy testing, where the target individual is a fetus, and the DNA that is sequenced comprises fetal DNA, and in some cases, maternal DNA, for example as found in the maternal plasma. Non-invasive prenatal aneuploidy testing attempts to determine the chromosome copy number of a fetus based on the free-floating fetal DNA in maternal plasma. In the quantitative method, chromosome copy number classification is based on the number of sequence reads which map to each chromosome. Neither parental genotype nor allelic information is used, except possibly to estimate the fetal fraction in the plasma. In this targeted sequencing approach, the number of sequence reads at each targeted SNP (single nucleotide polymorphism) is informative, in contrast to untargeted sequencing approaches that tend to use a sliding window average depth of read, or similar averaged approach. Based on the estimated fetal fraction, a maximum likelihood estimate is calculated based on the set of copy number hypotheses including monosomy, disomy, and trisomy. In this example, chromosome segmental errors are not considered, meaning that all positions on the same chromosome are assumed to have the same copy number. It should be clear to one of ordinary skill in the art how to apply this method to chromosome segment copy number variants. One may also incorporate non-uniform fragmentation of the fetal or maternal genome; this is not done here.

Modeling an individual SNP: A fundamental assumption in this method is that the number of sequence reads generated at a genome position depends primarily on the number of genome copies of that position going into the sequencing process. The targeted sequencing approach is based on multiplexed PCR, which means that the number of genome copies going into sequencing is determined both by the chromosome copy number in the original sample, and the details of the PCR amplification process. Thus, this method requires a simplified models of both multiplex PCR and high throughput sequencing.

One may assume that in the original sample, the amount of genome copies is the same at all positions, except due to variations in chromosome copy number. However, in the PCR process, each targeted position is amplified with a different efficiency. For each of k PCR cycles, a position i is amplified by a factor a_i. The number of observed reads at the position is x_i. This model can be written as in equation 1, where the sample factor c_sis constant per sample, and represents a sample parameter, for example the initial quantity of DNA and the total number of sequence reads. It can be thought of as the sample-specific amplification factor. The chromosome copy number n_iis the ploidy state or copy number of the chromosome where position i is located.

x_i=c_sn_ia_i^k (1)

However, slight variations in experimental conditions mean that the amplification efficiencies of the various PCR targets are not perfectly constant. This is represented by a multiplicative noise term ϵ_i, for the amplification efficiency of each target. The model is thus extended to equation 2.

x_i=c_sn_i(a_iϵ_i)^k (2)

Due to the multiplicative nature of the model, it is advantageous to work in log space, and then consider the expectation and the variance of log x_i. One may assume that the expectation of the log noise is zero. This is not quite the same as assuming zero-mean noise, but it makes the math feasible, shown in equation 3.

E log x_i=log n_i+k log a_i

V log x_i=k²V log ∈_i (3)

Sample normalization can be achieved by considering reads measured from positions located on chromosomes which are known, assumed, or hypothesized to have copy number equal to two. There are other methods of sample normalization such as using other reference chromosomes, for example chromosomes 1 and 2, which are known to be disomic. Let D be the set of positions i which are located on chromosomes assumed to be disomic. The sample normalizer T_sis defined as the average log count over positions i in D, detailed in equation 4. This can be measured directly from each sample, and so will be considered a known quantity for further calculations.

$\begin{matrix} \begin{matrix} T_{s} = E_{i} \in_{D} \log x_{i} \\ = \log c_{s} + \log 2 + k E_{i} \in_{D} \log a_{i} \end{matrix} & (4) \end{matrix}$

Constructing a model from training data: A model for the efficiency of individual SNPs can be constructed from a set of training data with known chromosome copy number and fetal fraction. In the ideal case, plasma is collected from (euploid) women who are not pregnant, and so the fetal fraction is zero and there are no aneuploidies. In this case, all samples contribute data for the model of all targets. In the more difficult case, pregnancy plasma with known chromsome copy number is used, and aneuploid samples are excluded from the data set. Thus, the model is still constructed from data where all chromosomes have the same copy number relative to disomy.

Let y_i, be the logspace normalized depth of read at position i. One may define β_ias the average over the set of samples, of y_i(5). The term β_iis the logspace amplification model for position i which measures how its amplification efficiency compares to the average amplification efficiency for positions on disomy chromosomes.

$\begin{matrix} \begin{matrix} y_{i} = \log x_{i} - T_{s} \\ = k \log a_{i} + k \log \in_{i} - k E_{i} \in_{D} \log a_{i} \end{matrix} & (5) \\ \begin{matrix} β_{i} = E_{s} y_{i} \\ = k \log a_{i} - k E_{i} \in_{D} \log a_{i} \end{matrix} \end{matrix}$

Similarly, σ_iis defined as the standard deviation across samples of y_i. Combined, the set of β_iand the set of σ_i, form the amplification model and the variance model for the set of SNPs i.

There are a number of subtleties associated with the model calculation. Most importantly, it is important to note that the model does not remain constant for a fixed set of targets subjected to a fixed protocol.

Although the models will be quite similar, attempts to use a fixed model across multiple sequencing runs have suffered from biases which are large enough to effect results at low fetal fraction, and may be eliminated by training separately for separate experiments. As a result, in some embodiments, it is important to ensure that each sequencing run contains a sufficient number of samples for modeling.

Even within an experiment, there are typically a number of samples which do not fit the model. These are often but not always explained by locus dropout, which is discussed in more detail in a later section. Outlier samples are not well predicted by quality control metrics such as contamination level, spike ratio (a measure of DNA starting quantity), fetal fraction, or overall depth of read. A sample is tested for goodness of fit by calculating the residual z, on each SNP with respect to the amplification and noise models.

z_i=(log x_i−T_s−β_i)/σ_i (6)

Under the further assumption that log ∈_ii, is not just zero-mean, but Gaussian, then z_ishould be distributed according to the standard normal. The set of disomy-chromosome residuals Z={z_i:i ∈D} is analyzed as an approximate metric for model fit. Regardless of fetal fraction or chromosome copy number, Z should be distributed according to the standard normal. A Kolmogorov-Smirnov (KS) test is used to measure goodness of fit of the residuals. The modeling process is implemented in an iterative fashion, where each iteration includes a recalculation of the model, followed by a KS test for the model fit of each sample. Outlier samples are removed from the training set at each iteration until the membership converges to a constant set.

Forming a test statistic and modeling SNP correlation: A test statistic for chromosome copy number classification can be formed by averaging the normalized measurements at all positions on a chromosome. A variance-weighted mean is selected in order to minimize the variance of the test statistic. Consider the normalized measurement)), defined above. For a position on a chromosome with unknown copy number ni, y, has the properties described in equation 7.

$\begin{matrix} {Ey}_{i} = \log \frac{n_{i}}{2} + β_{i} {Vy}_{i} = σ i^{2} & (7) \end{matrix}$

Let S be the set of positions on the current chromosome. The chromosome test statistic t is defined as the variance-weighted mean of y_i, averaged across SNPs i in S.

$\begin{matrix} t = \frac{\sum_{i ϵ} s \frac{y_{i}}{σ_{i}^{2}}}{\sum_{i ϵ} s \frac{1}{σ_{i}^{2}}} & (8) \end{matrix}$

The expectation of t will be calculated under each of the chromosome copy number hypotheses to form Gaussian models for the maximum likelihood estimate. The variance of the model for each hypothesis does not follow uniquely from the assumptions made previously, which have not considered correlation between measurements. The simplest assumption of uncorrelated measurements was discarded because the observed variances on t were much higher than that model would suggest. Without suggesting any physical explanation for correlation, a single-parameter correlation model is proposed in which the covariance of y_iwith y_jis ρσ_iσ_j, corresponding to a constant correlation factor between all positions i and j on the same chromosome. This model uses a single parameter to represent the additional variance beyond what would be implied by the uncorrelated model. The variance of t using the constant correlation model is shown in equation 9 which follows directly from the formula for the variance of a sum of normal distributions with known correlation. (The assumption of Gaussian noise is continued throughout.)

$\begin{matrix} Vt = {(\sum_{i} \frac{1}{σ_{i}^{2}})}^{- 2} (ρ \sum_{i} \sum_{j} \frac{1}{σ_{i} σ_{j}} + (1 - ρ) \sum_{i} \frac{1}{σ_{i}^{2}}) & (9) \end{matrix}$

A maximum likelihood estimate of ρ for each chromosome is calculated from the same modeling data following the estimation of {β_i} and {σ_i}.

Chromosome copy number classification consists of the following steps which make use of the modeling developed in the sections above.

1. Confirm model fit. Using the disomy chromosomes (one and two) a set of residuals is calculated with respect to the provided model, and a KS test is used to compare them to the standard normal distribution. If the resulting p-value is too low, the sample is considered not to fit the model, and cannot be classified.

2. Copy number hypothesis generation. Using the supplied fetal fraction, the plasma copy number is calculated corresponding to each fetal copy number hypothesis. For fetal copy number hypotheses {h₁, h₂, h₃}={1, 2, 3}, the plasma copy number hypotheses are calculated using the fetal fraction according to equation 10. The plasma copy number is a mixture of the fetal copy number, which depends on the hypothesis, and the maternal copy number, which is two.

n_i=fh_i+2(1−f) (10)

3. Hypothesis modeling. An expected value for the test statistic is calculated for the value of n_icorresponding to the ploidy hypotheses. This is done according to equation 7 and the definition of the test statistic. The variance model for the test statistic does not depend on the hypothesis.

4. Calculate likelihoods. The value of the test statistic is observed for the current chromosome. The data likelihood of each hypothesis is the likelihood of the test statistic under each of the corresponding normal distributions. The maximum likelihood estimate can then be reported, or normalized using priors.

Copy number classification without non-target reference chromosomes (also referred to as a “QMM” method)

As mentioned above, it is possible to identify copy number without using reference chromosomes or chromosome segments that are different than the target chromosome or chromosome segment, such that none of the chromosomes or chromosome segments can be assumed to have known copy numbers. This requires an alternate way of estimating the sample normalizer T_sand the linear shift parameter α_s, which are conditioned on the chromosome number hypotheses. Unlike the approach that uses copy number hypotheses for each individual chromosome, this hypothesis space contains joint hypotheses of all the training chromosomes.

In an embodiment, in order to connect the joint hypothesis to the individual hypothesis, the following technique may be used. For a training chromosome k∈{13, 18, 21}, let p(D|h_k), h_k∈{1, 2, 3} be the pdf of the data conditioned on the individual copy number hypothesis of that chromosome. So, for example, for chromosome 13 it would be:

$P (D | h 13) = \sum_{h 18} \sum_{h 21} p (D_{13} | h_{18}, h_{21}, h_{13}) p (D_{18} | h_{18}, h_{21}, h_{13}) p (D_{21} | h_{18}, h_{13}) P (h_{18}) P (h_{21})$

Assuming equal priors for the hypothesis probabilities, i.e., P(h_k=1)=P(h_k=2)=P(h_k=3)=⅓, the above pdf is computed. To compute p(D₁₃|h₁₈, h₂₁, h₁₃), the T_sand α_sestimates corresponding to the hypothesis (h₁₈, h₂₁, h₁₃) are used, and a variance weighted mean test statistic is computed. Similarly, the respective pdfs of the other training chromosomes, p(D|h₁₈), p(D|h₂₁) are computed. Since equal priors are assumed, the posterior probabilities are also computed:

$P (h_k | D) = (\frac{p (D | h_{k})}{\sum_{h_{j} \in {1, 2, 3}} p (D | h_{j})}), \forall k {13, 18, 21} .$

This represents a normalizing step which provides confidences for each of the training chromosomes.

Next, confidences of the rest of the chromosomes is computed. For this, an estimate of the joint hypothesis of the training chromosomes is obtained:

(ĥ₁₃, ĥ₁₈,ĥ₂₁)=arg max_h₁₃_,h₁₈_,h₂₁p(D|h₁₃, h₁₈, h₂₁)

The T_sand α_sestimates corresponding to this hypothesis (ĥ₁₃, ĥ₁₈, ĥ₂₁) can then be used to compute the variance weighted mean test statistic for each of the test chromosomes.

In this method, a constant correlation coefficient model can be used to model the inter-SNP correlations of a particular chromosome. For example, for a particular chromosome k, the covariance of y_iand y_jis ρ_iσ_iσ_j, as discussed above. If chromosome K has N_kloci, a covariance matrix is given by:

C(ρ_k)=(1−ρ_k)×diag(σ_k²)+ρ_k×σ_kσ_k^T

This represents a matrix with the σ_i²s on the main diagonal and the off-diagonal elements are ρ_kσ_iσ_j. This can also be used to determine the maximum likelihood estimates for each of T_sand α_s

An example of a quantitative allelic maximum likelihood method (“het rate”)

Provided herein are methods for determining the ploidy state using an allelic maximum likelihood method. The method will be illustrated in the context of NIPT, but a skilled artisan will appreciate that it can be utilized in detection of circulating free tumor cells. In addition to the discussion below, detailed examples of how to implement a het rate method can be found, among other places, in published US patent application US 2012/0270212 Al and published US patent application US 2011/0288780 A1, all of which are herein incorporated in their entirety by reference. However, the het rate method disclosed in these sources, utilize data from separate reference chromosomes

In the NIPT example, the ploidy state of a fetus given sequence data that was measured on free floating DNA isolated from maternal blood, wherein the free floating DNA contains some DNA of maternal origin, and some DNA of fetal/placental origin. In this example the ploidy state of the fetus is determined using the an allelic maximum likelihood method and a calculated fraction of fetal DNA in the mixture that has been analyzed. It will also describe an embodiment in which the fraction of fetal DNA or the percentage of fetal DNA in the mixture can be measured. In some embodiments the fraction can be calculated using only the genotyping measurements made on the maternal blood sample itself, which is a mixture of fetal and maternal DNA. In some embodiments the fraction may be calculated also using the measured or otherwise known genotype of the mother and/or the measured or otherwise known genotype of the father.

For a particular chromosome, suppose there are N SNPs, for which: Parent genotypes from ILLUMINA data, assumed to be correct: mother m=(m₁, . . . ,m_N), father=(f₁, . . . ,f_N), where m_i, f_i∈ (AA,AB, BB).

Set of NR sequence measurements S=(s₁, . . . ,s_nr).
Deriving most likely copy number from data

For each copy number hypothesis H considered, derive data log likelihood LIK(H) on a whole chromosome and choose the best hypothesis maximizing LIK, i.e.

$H^{*} = \underset{H}{argmax} LIK (H | D) = \underset{H}{argmax} LIK (D | H) P (H),$

where P(H) is a prior probability of the hypothesis, from prior knowledge or estimate.

Copy number hypotheses considered are:

Monosomy:

maternal H10(one copy from mother)

paternal H01(one copy from father)

Disomy: H11(one copy each mother and father)

Simple trisomy, no crossovers considered:

Maternal: H21_matched (two identical copies from mother, one copy from father), H21_unmatched (BOTH copies from mother, one copy from father)

Paternal: H12_matched (one copy from mother, two identical copies from father), H12_unmatched (one copy from mother, both copies from father)

Composite trisomy, allowing for crossovers (using a joint distribution model):

maternal H21 (two copies from mother, one from father),

paternal H12 (one copy from mother, two copies from father)

If there were no crossovers, each trisomy, whether the origin was mitosis, meiosis I, or meiosis II, would be one of the matched or unmatched trisomies. Due to crossovers, true trisomy is a combination of the two. First, a method to derive hypothesis likelihoods for simple hypotheses is described. Then a method to derive hypothesis likelihoods for composite hypotheses is described, combining individual SNP likelihood with crossovers. Initially, it is assumed that the true child fraction and other parameters such as beta noise parameter (N) and possible error rates are known. A method for deriving child fraction cf from data is also discussed below.

LIK(D|H) for Simple Hypotheses

For simple hypotheses H, LIK(D|H), the log likelihood of data given hypothesis H on a whole chromosome, is calculated as the sum of log likelihoods of individual SNPs, i.e.

$LIK (D | H) = \sum_{i} LIK (D | H, cf, i)$

This hypothesis does not assume any linkage between SNPs, and therefore does not utilize a joint distribution model.

Log Likelihood Per SNP

On a particular SNP i, define m_i=true mother genotype, f_i=true father genotype, and cf=known or derived child fraction. Let x_i=P(A|i,S) be the probability of having an A on SNP i, given the sequence measurements S. Assuming child hypothesis H, the log likelihood of observed data D on SNP i is defined as

LIK(i, H)=log lik(x_i|m_i, f_i, H, cf)=Σ_cp(c|m_i, f_i, H)*log lik(x_i|m_i, c, cf),

where p(c|m, f, H) is the probability of getting true child genotype=c, given parents m, f, and assuming hypothesis H, which can be easily calculated. For example, for H11, H21matched and H21 unmatched, p(c|m,f,H) is given below.

p(c|m, f, H) H11 H21 matched H21 unmatched m f AA AB BB AAA AAB ABB BBB AAA AAB ABB BBB AA AA 1 0 0 1 0 0 0 1 0 0 0 AB AA 0.5 0.5 0 0.5 0 0.5 0 0 1 0 0 BB AA 0 1 0 0 0 1 0 0 0 1 0 AA AB 0.5 0.5 0 0.5 0.5 0 0 0.5 0.5 0 0 AB AB 0.25 0.5 0.25 0.25 0.25 0.25 0.25 0 0.5 0.5 0 BB AB 0 0.5 0.5 0 0 0.5 0.5 0 0 0.5 0.5 AA BB 0 1 0 0 1 0 0 0 1 0 0 AB BB 0 0.5 0.5 0 0.5 0 0.5 0 0 1 0 BB BB 0 0 1 0 0 0 1 0 0 0 1

P(D|m,f,c,H,i,cf) is the probability of given data D on SNP i, given true mother genotype m, true father genotype f, true child genotype c, hypothesis H, and child fraction cf. It can be broken down into probability of mother, father, and child data as follows:

lik(x_i|m, c, cf) is the likelihood of getting derived probability x_ion SNP i, assuming true mother m, true child c, defined as pdfx(x_i) of the distribution that x_ishould be following if hypothesis H were true. In particular lik(x_i|m,c,cf)=pdfx(x_i)

In a simple case where Di of NR sequences in S line up to SNP i, X˜(1/D_i)Bin(p,D_i), where p=p(A|m,c,cf)=probability of getting an A, for this mother/child mixture, calculated as:

${Hetrate}_{A} = p (A | m, c, cf) = \frac{# A (m) & (1 - {cf}_{correct}) + # A (c) * {cf}_{correct}}{n_{m} * (1 - {cf}_{correct}) + n_{c} * {cf}_{correct}}$

where #A(g)=number of A's in genotype g, n_m=2 is somy of mother and n_cis somy of the child, (1 for monosomy, 2 for disomy, 3 for trisomy). The initial cf may be determined using, for example, an allele ratio plot.

cf_correctis corrected fraction of the child in the mixture:

${cf}_{correct} = cf * \frac{n_{c}}{n_{m} * (1 - cf) + n_{c} * f}$

If child is a disomy cf_correct=cf, but for a trisomy fraction of the child in the mix for this chromosome is actually a bit higher:

${cf}_{correct} = cf * \frac{3}{2 + cf} .$

In a more complex case where there is not exact alignment, X is a combination of binomials integrated over possible D_ireads per SNP.

Using A Joint Distribution Model: LIK(H) for a Composite Hypothesis

In real life, trisomy is usually not purely matched or unmatched, due to crossovers, so in this section results for composite hypotheses H21 (maternal trisomy) and H12(paternal trisomy) are derived, which combine matched and unmatched trisomy, accounting for possible crossovers.

In the case of trisomy, if there were no crossovers, trisomy would be simply matched or unmatched trisomy. Matched trisomy is where child inherits two copies of the identical chromosome segment from one parent. Unmatched trisomy is where child inherits one copy of each homologous chromosome segment from the parent. Due to crossovers, some segments of a chromosome may have matched trisomy, and other parts may have unmatched trisomy. Described in this section is how to build a joint distribution model for the heterozygosity rates for a set of alleles.

Suppose that on SNP i, LIK(i, Hm) is the fit for matched hypothesis H, and LIK(i, Hu) is the fit for UNmatched hypothesis H, and pc(i)=probability of crossover between SNPs i-1,i. One may then calculate the full likelihood as:

LIK(H)=Σ_S,ELIK(S, E, 1:N)

where LIK(S, E, 1:N) is the likelihood starting with hypothesis S, ending in hypothesis E, for SNPs 1:N. S=hypothesis of the first SNP, E=hypothesis of the last SNP, S,E∈ (Hm, Hu). Recursivelly one may calculate:

LIK(S, E, 1:i)=LIK(i, E)+log(exp(LIK(S, E, 1:i−1))*(1−pc(i))+exp(LIK(S, ˜E, 1:i−1))*pc(i))

where ˜E is the other hypothesis (not E). In particular, one may calculate the likelihood of 1:i SNPs, based on likelihood of 1:(i−1) SNPs with either the same hypothesis and no crossover or the opposite hypothesis and a crossover times the likelihood of the SNP i

For SNP i=1:

$LIK (S, E, 1 : 1) = {\begin{matrix} LIK (1, S) & if S = E \\ 0 & if S \neq E \end{matrix}$

Then calculate:

LIK(S, E, 1:2)=LIK(2, E)+log(exp(LIK(S, E, 1))*(1−pc(2))+exp(LIK(S, ˜E, 1))*pc(2))

and so on until i=N.

Deriving Child Fraction

The above formulas assume a known child fraction, which is not always the case. In one embodiment, it is possible to find the most likely child fraction by maximizing the likelihood for disomy on selected chromosomes.

In particular, supposes that LIK(chr, H11, cf)=log likelihood as described above, for the disomy hypothesis, and for child fraction cf on chromosome chr. For selected chromosomes in Cset (usually 1:16). Then the full likelihood is:

$LIK (cf) = \sum_{chr \in Cset} Lik (chr, H 11, cf), and {cf}^{*} = \underset{cf}{argmax} LIK (cf) .$

It is possible to use any set of chromosomes. It is also possible to derive child fraction without paternal data, as follows.

Deriving Copy Number Without Paternal Data

Recall the formula of the simple hypothesis log likelihood on SNP i:

$LIK (i, H) = \log lik (x_{i} | m_{i}, f_{i}, H, cf) = \sum_{c} p (c | m_{i}, f_{i}, H) * \log lik (x_{i} | m_{i}, c, H, cf)$

Determining the probability of the true child given parents p(c|m_i, f_i, H) requires the knowledge of father genotype. If the father genotype is unknown, but pAi, the population frequency of A allele on this SNP, is known, it is possible to approximate the above likelihood with

$LIK (i, H) = \log lik (x_{i} | m_{i}, f_{i}, H, cf) = \sum_{c} p (c | m_{i}, H) * \log lik (x_{i} | m_{i}, c, H, cf)$ $where$ $p (c | m_{i}, H) \sum_{f} p (c | m_{i}, f, H) * p (f | p A_{i})$

where p(f|pA_i) is the probability of particular father genotype, given the frequency of A on SNP i.

In particular:

ti (AA|pA_i)=(pA_i)², p(AB|pA_i)=2(pA₁)*(1−pA_i), p(BB|pA_i)=(1−pA_i)²
Training Method without using a Control Chromosome or Chromosome Segment

Suppose, we have 3 data segments D₁, D₂and D₃. Suppose that P(H) is the current prior on segment D₁. Suppose that p is a parameter with distribution P(p) (e.g., child fraction cf or noise parameter np). Then probability for a certain hypothesis H (with prior P(H)) to be true equals:

$P (H \langle D_{1}, D_{2}, D_{3}) = \frac{1}{P (D_{1}, D_{2}, D_{3})} \sum_{p} P (D_{1}, D_{2}, D_{3}, H, p)$

which results in

$P (H \langle D_{1}, D_{2}, D_{3}) = \frac{P (D_{2}, D_{3})}{P (D_{1}, D_{2}, D_{3})} \sum_{p} P (D_{1} \langle H, p) P (H) P (p \langle D_{2}, D_{3})$

or, to approximate,

$P (H \langle D_{1}, D_{2}, D_{3}) ~ \sum_{p} P (D_{1} \langle H, p) P (H) P (p \langle D_{2}, D_{3})$

where the term P(D₁|H, p) can be re-written as

$P (D_{1} \langle H, p) = P (D_{1}) \frac{P (H \langle D_{1}, p)}{P (H)} \frac{P (p \langle D_{1})}{P (p)} Thus, P (H \langle D_{1}, D_{2}, D_{3}) ~ \sum_{p} P (H \langle D_{1}, p) \frac{P (p \langle D_{1})}{P (p)} P (p \langle D_{2}, D_{3}),$

where the term P(p|D₂, D₃) is a parameter distribution obtained from “training” on segments D₂and D₃. P(p|D₁)/P(p) depends on what the actual hypothesis for segment 1 is, and may be dropped if unknown. The approximation loses some information, but it can be more stable and intuitive, since each piece is on a probability scale, and fits call per grid point, scaled by grid point probability.

Significant processing advantages can be obtained if a control chromosome or chromosome segment is not required, as the tests can be run on only the chromosome(s) or chromosome segment(s) of interest. In an embodiment, the chromosomes or chromosome segments of interest themselves provide a baseline that can then be used to evaluate the accuracy of the given hypotheses. For example, by using the formula

$P (p \langle D_{1}, D_{2}, D_{3} = \frac{P (p \langle D_{1})}{P (p)} \cdot \frac{P (p \langle D_{2})}{P (p)} \cdot \frac{P (p \langle D_{3})}{P (p)} \cdot P (p),$

the above probability equation can also be written as:

$P (H \langle D_{1}, D_{2}, D_{3}) ~ \sum_{p} P (H \langle D_{1}, p) \frac{P (p \langle D_{1})}{P (p)} P (p \langle D_{2}, D_{3}) = \sum_{p} P (H \langle D_{1}, p) P (p \langle D_{1,} D_{2}, D_{3})$

In this equation, the probability P(H|D₁, p) is obtained per grid point, and is then scaled by the best parameter distribution estimate given P(p, D₁, D₂, D₃). Once the grid points are fixed, P(H|D₁, p) does not change. However, when no fixed hypothesis exists (i.e., no control chromosome or chromosome segment is used) for P(p, D₁, D₂, D₃), the final answer for P(H|D₁, D₂, D₃) can vary greatly depending on the prior put on each segment hypothesis.

In other words, since the parameter distribution given all the data is a composite of parameter distributions for each segment,

$P (p \langle D_{i}) ~ \sum_{G} P (D_{i} \langle p, G) P (G) P (p)$

where P(G) is the hypothesis prior used on this segment for purposes of parameter estimation.

To account for the lack of a control, a uniform hypothesis prior f_prior(H) for hypothesis H is obtained. For example, this may be obtained by estimating child fraction using an allele ratio plot as discussed above. Then, for each grid point p, calculate a probability of the hypothesis (“per-grid call”):

P(H|D₁,p)˜P(D₁|H, p)P(H)

where P(H) is the hypothesis prior used for segment calling. In an embodiment, this is done only once to provide an idea of the calls for the entire grid space.

For the first pass, f_prior(H) is set to be P(H). The parameter distribution for each segment is then obtained using:

$P (p \langle D_{i}) ~ \sum_{H} P (D_{i} \langle p, G) f_{prior} (H) P (p)$

The composite parameter distribution is then obtained:

$P (p \langle D_{1}, D_{2}, D_{3}) = \frac{P (p \langle D_{1})}{P (p)} \frac{P (p \langle D_{2})}{P (p)} \frac{P (p \langle D_{3})}{P (p)} P (p)$

The (posterior) probability of each hypothesis is then obtained by combining parameter scaling to the per grid call:

$P (H \langle D_{1}, D_{2}, D_{3}) = \sum_{p} P (H \langle D_{1}, p) P (p \langle D_{1}, D_{2}, D_{3}) .$

This provides a new estimate of the distribution of the hypothesis per each segment. F_prior(H) can be updated with the newly derived P(H|D₁, D₂, D₃), and the process (starting with calculating the probability of the hypothesis for each grid point p) is repeated until convergence.

Convergence is reached the total likelihood does not change anymore to any appreciable extent. In an embodiment, this can be treated as an annealing problem, with the function to be optimized being the likelihood of the data P(H1D_i, D₂, D₃) maximized by the best derived posterior P(H) and P(p) distributions. That is, the function to maximize is:

L(D)=P(D₁, D₂, D₃)˜Σ_HΣ_pP(D|H, p)P(H)P(p).

The hypotheses with final probabilities (i.e., calls), child fraction, and noise parameters can then be output.

In certain embodiments of the present disclosure, a method of the invention for determining aneuploidy can include a quantitative allelic method, technique, or algorithm that can be used to determine the relative ratios of two or more different haplotypes that contain the same set of loci in a sample of DNA. The different haplotypes could represent two different homologous chromosomes from one individual, three different homologous chromosomes from a trisomic individual, three different homologous haplotypes from a mother and a fetus where one of the haplotypes is shared between the mother and the fetus, three or four haplotypes from a mother and fetus where one or two of the haplotypes are shared between the mother and the fetus, or other combinations. If one or more of the haplotypes are known, or the diploid genotypes of one or more of the individuals are known, then a set of alleles that are polymorphic between the haplotypes can be chosen, and average allele ratios can be determined based on the set of alleles that uniquely originate from each of the haplotypes.

Direct sequencing of such a sample, however, is extremely inefficient as it results in many sequences for regions that are not polymorphic between the different haplotypes in the sample and therefore reveal no information about the proportion of the two haplotypes. Described herein is a method that specifically targets and enriches segments of DNA in the sample that are more likely to be polymorphic in the genome to increase the yield of allelic information obtained by sequencing. Note that for the allele ratios measured in an enriched sample to be truly representative of the actual haplotype ratios it is critical that there is little or no preferential enrichment of one allele as compared to the other allele at a given loci in the targeted segments. Current methods known in the art to target polymorphic alleles are designed to ensure that at least some of any alleles present are detected. However, these methods were not designed for the purpose of measuring the allele ratio of polymorphic alleles present in the original mixture. It is non-obvious that any particular method of target enrichment would be able to produce an enriched sample wherein the proportion of various alleles in the enriched sample is about the same as to the ratios of the alleles in the original unamplified sample. While enrichment methods may be designed, in theory, to accomplish such an aim, an ordinary person skilled in the art is aware that there is a great deal of stochastic or deterministic bias in current methods. On embodiment of the method described herein allows a plurality of alleles found in a mixture of DNA that correspond to a given locus in the genome to be amplified, or preferentially enriched in a way that the degree of enrichment of each of the alleles is nearly the same. Another way to say this is that the method allows the relative quantity of the alleles present in the mixture as a whole to be increased, while the ratio between the alleles that correspond to each locus remains essentially the same as they were in the original mixture of DNA. For the purposes of this disclosure, for the ratio to remain essentially the same, it is mean that the ratio of the alleles in the orginal mixture divided by the ratio of the alleles in the resulting mixture is between 0.5 and 1.5, between 0.8 and 1.2, between 0.9 and 1.1, between 0.95 and 1.05, between 0.98 and 1.02, between 0.99 and 1.01, between 0.995 and 1.005, between 0.998 and 1.002, between 0.999 and 1.001, or between 0.9999 and 1.0001.

Allele Distributions

In certain embodiments, the goal of the method is to detect fetal copy number based on a maternal blood sample which contains some free-floating fetal DNA. In some embodiments, the fraction of fetal DNA compared to the mother's DNA is unknown. The combination of a targeting method, such as LIPS, followed by sequencing results in a platform response that consists of the count of observed sequences associated with each allele at each SNP. The set of possible alleles, either A/T or C/G, is known at each SNP. Without loss of generality, the first allele will be labeled A and the second allele will be labeled B. Thus, the measurement at each SNP consists of the number of A sequences (NA) and the number of B sequences (NB). These will be transformed for the purpose of future calculations into the total sequence count (n) and the ratio of A alleles to total (r). The sequence count for a single SNP will be referred to as the depth of read. The fundamental principal which allows copy number identification from this data is that the ratio of A and B sequences will reflect the ratio of A and B alleles present in the DNA being measured.

n=N_A+N_B

r=N_A/(N_A+N_B)

Measurements will be initially aggregated over SNPs from the same parent context based on unordered parent genotypes. Each context is defined by the mother genotype and the father genotype, for a total of 9 contexts. For example, all SNPs where the mother's genotype is AA and the father's genotype is BB are members of the AA|BB context. The A allele is defined as present at ratio r_min the mother genotype and ratio rf in the father genotype. For example, the allele A is present at ratio r_m=1 where the mother is AA and ratio rf=0.5 where the father is AB. Thus, each context defines values for r_mand r_f. Although the child genotypes cannot always be predicted from the parent genotypes, the allele ratio averaged over a large number of SNPs can be predicted based on the assumption that a parent AB genotype will contribute A and B at equal rates.

Consider a copy number hypothesis for the child of the form (n_m,n_f) where n_mis the number of mother copies and nr is the number of father copies of the chromosome. The expected allele ratio r_cin the child (averaged over SNPs in a particular parent context) depends on the allele ratios of the parent contexts and the parent copy numbers.

$\begin{matrix} r_{c} = \frac{n_{m} r_{m} + n_{f} r_{f}}{n_{m} n_{f}} & (1) \end{matrix}$

In a mixture of maternal and fetal blood, allele copies will be contributed from both the mother directly and from the child. Assume that the fraction of child DNA present in the mixture is S. Then in the mixture, the ratio r of the A allele in a given context is a linear combination of the mother ratio r_mand the child ratio r_c, which can be reduced to a linear combination of the mother ratio and father ratio using equation 1.

$\begin{matrix} r = (1 - δ) r_{m} + δ r_{c} = (1 - \frac{δ n_{f}}{n_{m} + n_{f}}) r_{m} + \frac{δ n_{f}}{n_{m} + n_{f}} r_{f} & (2) \end{matrix}$

Equation 2 predicts the expected ratio of A alleles for SNPs in a given context as a function of the copy number hypothesis (n_m,n_f). Note that the allele ratio on individual SNPs is not predicted by this equation because these depend on random assignment where at least one parent is heterozygous. Therefore, the set of sequences from all SNPs in a particular context will be combined. Assuming that the context contains m SNPs, and recalling that n sequences will be produced from each SNP, the data from that context consists of N =mn sequences. Each of the N sequences is considered an independent random trial where the theoretical rate of A sequences is the allele ratio r. The measured rate of A sequences {circumflex over (r)} is therefore known to be Gaussian distributed with mean r and variance σ²=r(1−r)/N.

Recall that the theoretical allele ratio is a function of the parent copy numbers (n_m,n_f). Thus, each hypothesis h results in a predicted allele ratio r_i^hfor the SNP in parent context i. The data likelihood is defined as the probability of a given hypothesis producing the observed data. Thus, the likelihood of measurement r_i^hfrom context i under hypothesis h is a binomial distribution, which can be approximated for large N as a Gaussian distribution with the following mean and variance. The mean is determined by the context and the hypothesis as described in equation 2.

$p ({\hat{r}}_{i} \langle h) = N ({\hat{r}}_{i}; μ, σ) μ = r_{i}^{h} σ = \sqrt{\frac{r_{i}^{h} (1 - r_{i}^{h})}{N_{i}}}$

The measurements on each of the nine contexts are assumed independent given the parent copy numbers, due to the common assumption of independent noise on each SNP. Thus, the data from a particular chromosome consists of the sequence measurements from contexts i ranging from 1 to 9. The likelihood of the observed allele ratios {{circumflex over (r)}₁. . . , {circumflex over (r)}₉} from the whole chromosome is therefore the product of the individual context likelihoods:

$p ({\hat{r}}_{1} . . ., {\hat{r}}_{9}) = \prod_{i = 1}^{9} p ({\hat{r}}_{i} \langle h) = \prod_{i = 1}^{9}  ({\hat{r}}_{i}; r_{i}^{h}, \sqrt{\frac{r_{i}^{h} (1 - r_{i}^{h})}{N_{i}}})$

Parameter Estimation

Equation 2 predicts the allele ratio as a function of parent copy number hypothesis, but also includes the fraction of child DNA. Therefore, the data likelihood for each chromosome is a function of through its effect on r_i^h. This effect is highlighted through the notation p({circumflex over (r)}₁. . . , {circumflex over (r)}₉|h; δ). This parameter cannot be predicted with high accuracy, and therefore must be estimated from the data. A number of different approaches may be used for parameter estimation. One method involves the measurement of chromosomes for which copy number errors are not viable at the stage of development where testing will be performed. The other method measures only chromosomes on which errors are expected to occur.

Measure Some Chromosomes Known to be Disomy

In this method, certain chromosomes will be measured which cannot have copy number errors at the state of development when testing is performed. These chromosomes will be referred to as the training set T. The copy number hypothesis on these chromosomes is (1,1). Assuming that each chromosome is independent, the data likelihood of the measurements from all chromosomes t in T is the product of the individual chromosome likelihoods. The child fraction δ can be selected to maximize the data likelihood across the chromosomes in T conditioned on the disomy hypothesis. Let R_trepresent the set of measurements {circumflex over (r)}_i; from all contexts i on chromosome t. Then, the maximum likelihood estimate δ* solves the following:

$δ^{*} = \underset{δ}{argmin} \prod_{t ϵ T} p (R_{t} \langle h = (1, 1); δ)$

This optimization has only one degree of freedom constrained between zero and one, and therefore can easily be solved using a variety of numerical methods. The solution δ can then be substituted into equation 2 in order to calculate the likelihoods of each hypothesis on each chromosome.

Measure Only Chromosomes Which May Have Copy Number Errors

If copy number errors are possible on all of the chromosomes being measured, the accuracy of the ploidy determination increases greatly if fetal fraction is estimated in parallel with the copy number hypotheses. Note that the same copy number error present on all measured chromosomes will be very difficult to detect. For example, maternal trisomy on all chromosomes at a given child concentration will result in the same theoretical allele ratios as disomy on all chromosomes at lower child concentration, because in both cases the contribution of mother alleles compared to father alleles increases uniformly across all chromosomes and contexts.

A straight forward approach for classification of a limited set of chromosomes t is to consider the joint chromosome hypothesis H, which consists of the joint set of hypotheses for all chromosomes being tested. If the chromosome hypotheses consist of disomy, maternal trisomy and paternal trisomy, the number of possible joint hypotheses is 3^Twhere T is the number of tested chromosomes. A maximum likelihood estimate δ*(H) can be calculated conditioned on each joint hypothesis. The likelihood of the joint hypothesis is thus calculated as follows:

$\begin{matrix} δ^{*} (H) = \underset{δ}{argmax} Π_{t = 1}^{T} p (R_{t} | H; δ) \\ p (all data | H) = Π_{t = 1}^{T} p (R_{t} | H; δ^{*} (H)) \end{matrix}$

The joint hypothesis likelihoods p(all data|H) can be calculated for each joint hypothesis H, and the maximum likelihood hypothesis is selected, with its corresponding estimate δ*(H) of the child fraction.

Performance Specifications

The ability to distinguish between parent copy number hypotheses is determined by models discussed in the previous section. At the most general level, the difference in expected allele ratios under the different hypotheses must be large compared to the standard deviations of the measurements. Consider the example of distinguishing between disomy and maternal trisomy, or hypotheses h₁=(1,1) and h₂=(2, 1). Hypothesis 1 predicts allele ratio r¹and hypothesis 2 predictions allele ratio r², as a function of the mother allele ratio r_mand father allele ratio r_ffor the context under consideration.

$r^{1} = (1 - \frac{δ}{2}) r_{m} + \frac{δ}{2} r_{f}$ $r^{2} = (1 - \frac{δ}{3}) r_{m} + \frac{δ}{3} r_{f}$

The measured allele ratio {circumflex over (r)} is predicted to be Gaussian distributed, either with mean r¹or mean r², depending on whether hypothesis 1 or 2 is true. The standard deviation of the measured allele ratio depends similarly on the hypothesis, according to equation 3. In a scenario where one can expect to identify either hypothesis 1 or 2 as truth based on the measurement {circumflex over (r)}, the means r¹, r²and standard deviations σ¹, σ²must satisfy a relationship such as the following, which guarantees that the means are far apart compared to the standard deviations. This criterion represents a 2 percent error rate, meaning a 2 percent chance of either false negative or false positive.

|r¹−r²|>2 σ¹+2 σ²

Substituting the copy numbers for disomy (1, 1) and maternal trisomy (2, 1) for hypotheses 1 and 2 results in the following condition:

$| \frac{δ}{6} (r_{f} - r_{m}) | > 2 σ 1 + 2 σ 2$ $σ^{1} = \sqrt{\frac{r_{1} (1 - r^{1})}{N}}$ $σ^{2} = \sqrt{\frac{r_{2} (1 - r^{2})}{N}}$ $σ^{2} = \sqrt{\frac{r_{2} (1 - r^{2})}{N}}$

Overview of an Analysis Method Utilized in Methods Provided Herein

In certain examples of embodiments of the present disclosure, using the parent contexts, and chromosomes known to be euploid, it is possible to estimate, by a set of simultaneous equations, the proportion of DNA in the maternal blood from the mother and the proportion of DNA in the maternal blood from the fetus. These simultaneous equations are made possible by the knowledge of the alleles present on the father. In particular, alleles present on the father and not present on the mother provide a direct measurement of fetal DNA. One may then look at the particular chromosomes of interest, such as chromosome 21, and see whether the measurements on this chromosome under each parental context are consistent with a particular hypothesis, such as H_mpwhere m represents the number of maternal chromosomes and p represents the number of paternal chromosomes e.g. H₁₁representing euploid, H₂₁and H₁₂representing maternal and paternal trisomy respectively.

This method, unlike certain other methods for detecting chromosome ploid, does not use a reference chromosome as a basis by which to compare observed allelic ratios on the chromosome of interest to make a determination of aneuploidy.

This disclosure presents methods by which one may determine the ploidy state of a gestating fetus, at one or more chromosome, in a non-invasive manner, using genetic information determined from fetal DNA found in maternal blood. The fetal DNA may be purified, partially purified, or not purified; genetic measurements may be made on DNA that originated from more than one individual. Informatics type methods can infer genetic information of the target individual, such as the ploidy state, from the bulk genotypic measurements at a set of alleles. The set of alleles may contain various subsets of alleles, wherein one or more subsets may correspond to alleles that are found on the target individual but not found on the non-target individuals, and one or more other subsets may correspond to alleles that are found on the non-target individual and are not found on the target individual. The method may involve using comparing ratios of measured output intensities for various subsets of alleles to expected ratios given various potential ploidy states. The platform response may be determined, and a correction for the bias of the system may be incorporated into the method.

Key Assumptions of the Method:

The expected amount of genetic material in the maternal blood from the mother is constant across all loci.

The expected amount of genetic material present in the maternal blood from the fetus is constant across all loci assuming the chromosomes are euploid.

The chromosomes that are non-viable (all excluding 13,18,21,X,Y) are all euploid in the fetus. In one embodiment, only some of the non-viable chromosomes need be euploid on the fetus.

General Problem Formulation:

One may write y_ijk=g_ijk(x_ijk)+b_ijkwhere x_ijkis the quantity of DNA on the allele k=1 or 2 (1 represents allele A and 2 represents allele B), j=1 . . . 23 denotes chromosome number and i=1 . . . N denotes the locus number on the chromosome, gijk is platform response for particular locus and allele ijk, and v_ijkis independent noise on the measurement for that locus and allele. The amount of genetic material is given by x_ijk=am_ijk+Δc_ijkwhere a is the amplification factor (or net effect of leakage, diffusion, amplification etc.) of the genetic material present on each of the maternal chromosomes, m_ijk(either 0,1,2) is the copy number of the particular allele on the maternal chromosomes, Δ is the amplification factor of the genetic material present on each of the child chromosomes, and c_ijkis the copy number (either 0,1,2,3) of the particular allele on the child chromosomes. Note that for the first simplified explanation, a and A are assumed to be independent of locus and allele i.e. independent of i, j, and k. This gives:

y_ijk=g_ijk(am_ijk+Δc_ijk)+v_ijk

Approach Using an Affine Model that is Uniform Across All Loci:

One may model g with an affine model, and for simplicity assume that the model is the same for each locus and allele, although it will be understood after reading this disclosure how to modify the approach when the affine model is dependent on i,j,k. Assume the platform response model is

g_ijk(x_ijk)=b+am_ijk+Δc_ijk

where amplification factors a and A have been used without loss of generality, and a y-axis intercept b has been added which defines the noise level when there is no genetic material. The goal is to estimate a and Δ. It is also possible to estimate b independently, but assume for now that the noise level is roughly constant across loci, and only use the set of equations based on parent contexts to estimate a and Δ. The measurement at each locus is given by

y_ijk=b+am_ijk+Δc_ijk+v_ijk

Assuming that the noise v_ijkis i.i.d. for each of the measurements within a particular parent context, T, one can sum the signals within that parent context. The parent contexts are represented in terms of alleles A and B, where the first two alleles represent the mother and the second two alleles represent the father: T ϵ {AA|BB, BB|AA, AB|AB, AA|AA, BB|BB, AA|AB, AB|AA, AB|BB, BB|AB}. For each context T, there is a set of loci i,j where the parent DNA conforms to that context, represented i,j ϵ T. Hence:

$y_{T, k} = \frac{1}{N_{T}} \sum_{i, j \in T} y_{i, j, k} = b + a \overline{m_{k, T}} + Δ \overline{c_{k, T}} + v_{k, T}$

Where m_k,T, c_k,T, and v_k,T, represent the means of the respective values over all the loci conforming to the parent context T, or over all i, j ϵ T. The mean or expected values c_k,T, will depend on the ploidy status of the child. The table below describes the mean or expected values m_k,T, and c_k,T, for k=1(allele A) or 2(allele B) and all the parent contexts T. One may calculate the expected values assuming different hypotheses on the child, namely euploidy and maternal trisomy. The hypotheses are denoted by the notation H_mf, where m refers to the number of chromosomes from the mother and f refers to the number of chromosomes from the father e.g. H₁₁is euploid, H₂₁is maternal trisomy. Note that there is symmetry between some of the states by switching A and B, but all states are included for clarity:

Context AA/BB BB/AA AB/AB AA/AA BB/BB AA/AB AB/AA AB/BB BB/AB m_A,T 2 0 1 2 0 2 1 1 0 m_B,T 0 2 1 0 2 0 1 1 2 c_A,T|H₁₁ 1 1 1 2 0 1.5 1.5 0.5 0.5 c_B,T|H₁₁ 1 1 1 0 2 0.5 0.5 1.5 1.5 c_A,T|H₂₁ 2 1 1.5 3 0 2.5 2 1 0.5 c_B,T|H₂₁ 1 2 1.5 0 3 0.5 1 2 2.5

It is now possible to write a set of equations describing all the expected values y_T,k,which can be cast in matrix form, as follows:

$Y = B + A_{H} P + v$ $Where$ $Y = \begin{matrix} [y_{AA | BB, 1} y_{BB | AA, 1} y_{AB | BB, 1} y_{AA | AA, 1} y_{BB | BB, 1} y_{AA | AB, 1} \\ y_{AB | AA, 1} y_{AB | BB, 1} y_{BB | AB, 1} y_{AA | BB, 2} y_{BB | AA, 2} y_{AB | AB, 2} \\ {y_{AA | AA, 2} y_{BB | BB, 2} y_{AA | AB, 2} y_{AB | AA, 2} y_{AB | BB, 2} y_{BB | AB, 2}]}^{T} \end{matrix}$ $P = [\begin{matrix} a \\ Δ \end{matrix}] is the matrix of parameters to estimate$

B=b{right arrow over (1)} where {right arrow over (1)} is the 18×1 matrix of ones
v=[v_A,AA|BB . . . v_B,BB|BB]^Tis the 18×1 matrix of noise terms
and A_His the matrix encapsulating the data in the table, where the values are different for each hypothesis H on the ploidy state of the child. Below are examples of the Matrix A_Hfor the ploidy hyopotheses H₁₁and H₂₁

$A_{H_{11}} = [\begin{matrix} 2.0 & 1.0 \\ 0 & 1.0 \\ 1.0 & 1.0 \\ 2.0 & 2.0 \\ 0 & 0 \\ 2.0 & 1.5 \\ 1.0 & 1.5 \\ 1.0 & 0.5 \\ 0 & 0.5 \\ 0 & 1.0 \\ 2.0 & 1.0 \\ 1.0 & 1.0 \\ 0 & 0 \\ 2.0 & 2.0 \\ 0 & 0.5 \\ 1.0 & 0.5 \\ 1.0 & 1.5 \\ 2.0 & 1.5 \end{matrix}]$ $A_{H_{21}} = [\begin{matrix} 2.0 & 2.0 \\ 0 & 1.0 \\ 1.0 & 1.5 \\ 2.0 & 3.0 \\ 0 & 0 \\ 2.0 & 2.5 \\ 1.0 & 2.0 \\ 1.0 & 1.0 \\ 0 & 0.5 \\ 0 & 1.0 \\ 2.0 & 2.0 \\ 1.0 & 1.5 \\ 0 & 0 \\ 2.0 & 3.0 \\ 0 & 0.5 \\ 1.0 & 1.0 \\ 1.0 & 2.0 \\ 2.0 & 2.5 \end{matrix}]$

In order to estimate a and Δ, or matrix P, aggregate the data across a set of chromosomes that one may assume are euploid on the child sample. This could include all chromosomes j=1 . . . 23 except those that are under test, namely j=13, 18, 21, X and Y. (Note: one could also apply a concordance test for the results on the individual chromosomes in order to detect mosaic aneuploidy on the non-viable chromosomes.) In order to clarify notation, define Y′ as Y measured over all the euploid chromosomes, and Y″ as Y measured over a particular chromosome under test, such as chromosome 21, which may be aneuploid. Apply the matrix A_H₁₁to the euploid data in order to estimate the parameters:

$\hat{P} = {argmin}_{P} || Y^{'} - B - A_{H_{11}} P {||}_{2} = {(A_{H_{11}}^{T} A_{H_{11}})}^{- 1} A_{H_{11}}^{T} \tilde{Y}$

where {tilde over (Y)}=Y′−B, i.e., the measured data with the bias removed. The least-squares solution above is only the maximum-likelihood solution if each of the terms in the noise matrix v has a similar variance. This is not the case, most simply because the number of loci N′_Tused to compute the mean measurement for each context T is different for each context. As above, use the N_T′ to refer to the number of loci used on the chromosomes known to be euploid, and use the C′ to denote the covariance matrix for mean measurements on the chromosomes known to be euploid. There are many approaches to estimating the covariance C′ of the noise matrix v, which one may assume is distributed as v˜N(0, C′). Given the covariance matrix, the maximum-likelihood estimate of P is

$\hat{P} = {argmin}_{P} || C^{'^{- 1 / 2}} (Y^{'} - B - A_{H_{11}} P {||}_{2} = {(A_{H_{11}}^{T} C^{'^{- 1}} A_{H_{11}})}^{- 1} A_{H_{11}}^{T} C^{'^{- 1}} \tilde{Y}$

One simple approach to estimating the covariance matrix is to assume that all the terms of v are independent (i.e. no off-diagonal terms) and invoke the Central Limit Theorem so that the variance of each term of v scales as 1/N′_Tso that one may find the 18×18 matrix

$C^{'} = [\begin{matrix} 1 / N_{AA | BB}^{'} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & 1 / N_{BB | AB}^{'} \end{matrix}]$

Once P′ has been estimated, use these parameters to determine the most likely hypothesis on the chromosome under study, such as chromosome 21. In other words, choose the hypothesis:

H*=arg min_H∥C″^−1/2(Y″−B−A_H{circumflex over (P)}∥₂

Having found H* one may then estimate the degree of confidence that one may have in the determination of H*. Assume, for example, that there are two hypotheses under consideration: H₁₁(euploid) and H₂₁(maternal trisomy). Assume that H*=H₁₁. Compute the distance measures corresponding to each of the hypotheses:

d₁₁=∥C″^−1/2(Y″−B−A_H₁₁{circumflex over (P)}∥₂

d₂₁=∥C″^−1/2(Y″−B−A_H₂₁{circumflex over (P)}∥₂

It can be shown that the square of these distance measures are roughly distributed as a Chi-Squared random variable with 18 degrees of freedom. Let χ18 represent the corresponding probability density function for such a variable. One may then find the ratio in the probabilities pH of each of the hypotheses according to:

$\frac{P_{H_{11}}}{P_{H_{21}}} = \frac{χ_{18} (d_{11^{2}})}{χ_{18} (d_{21^{2}})}$

One may then compute the probabilities of each hypothesis by adding the equation P_H₁₁+P_H₂₁=1. The confidence that the chromosome is in fact euploid is given by P_H₁₁.

Variations on the Method

(1) One may modify the above approach for different biases b on each of the channels representing alleles A and B. The bias matrix B is redefined as follows:

$B = [\begin{matrix} b_{A} \vec{1} \\ b_{B} \vec{1} \end{matrix}]$

where {right arrow over (1)} is a 9×1 matrix of ones. As discussed above, the parameters b_eand b_ibcan either be assumed based on a-priori measurements, or can be included in the matrix P and actively estimated (i.e. there is sufficient rank in the equations over all the contexts to do so).

(2) In the general formulation, where y_ijk=g_ijk(am_ijk+Δc_ijk)+v_ijk, one may directly measure or calibrate the function g_ijkfor every locus and allele, so that the function (which one may assume is monotonic for the vast majority of genotyping platforms) can be inverted. One may then use the function inverse to recast the measurements in terms of the quantity of genetic material so that the system of equations is linear i.e. y′_ijk=g_ijk⁻¹(y_ijk)=am_ijk+Δc_ijk+v′_ijk. This approach is particularly good when g_ijkis an affine function so that the inversion does not produce amplification or biasing of the noise in v′_ijk.

(3) The method above may not be optimal from a noise perspective since the modified noise term v′_ijk=g_ijk⁻¹(v_ijk) may be amplified or biased by the function inversion. Another approach is to linearism the measurements around an operating point i.e. y_ijk=g_ijk(am_ijk+Δc_ijk)+v_ijkmay be recast as: y_ijk≈g_ijk(am_ijk)+g_ijk′(am_ijk)Δc_ijk+v_ijk. Since one may expect no more than 30% of the free-floating DNA in the maternal blood to be from the child, Δ<<a, and the expansion is a reasonable approximation. Alternatively, for a platform response such as that of the ILLUMINA BEAD ARRAY, which is monotonically increasing and for which the second derivative is always negative, one could improve the linearization estimate according to y_ijk≈g_ijk(am_ijk)+0.5 (g_ijk′(am_ijk)+g_ijk′(am_ijk+Δc_ijk)) Δc_ijk+v_ijk. The resulting set of equations may be solved iteratively for a and Δ using a method such as Newton-Raphson optimization.

(4) Another general approach is to measure at the total amount of DNA on the test chromosome (mother plus fetus) and compare with the amount of DNA on all other chromosomes, based on the assumption that amount of DNA should be constant across all chromosomes. Although this is simpler, one disadvantage is that it is now known how much is contributed by the child so it is not possible to estimate confidence bounds meaningfully. However, one could look at standard deviation across other chromosome signals that should be euploid to estimate the signal variance and generate a confidence bound. This method involves including measurements of maternal DNA which are not on the child DNA so these measurements contribute nothing to the signal but do contribute directly to noise. In addition, it is not possible to calibrate out the amplification biases amongst different chromosomes. To address this last point, it is possible to find a regression function linking each chromosome's mean signal level to every other chromosomes mean signal level, combine the signal from all chromosome by weighting based on variance of the regression fit, and look to see whether the test chromosome of interest is within the acceptable range as defined by the other chromosomes.

Incorporating Data Dropouts

Elsewhere in this disclosure it has been assumed that the probability of getting an A is a direct function of the true mother genotype, the true child genotype, the fraction of the child in the mix, and the child copy number. It is also possible that mother or child alleles can drop out, for example instead of having true child AB in the mix, there is only A, in which case the chance of getting a nexus sequence measurement of A are much higher. Assume that mother dropout rate is MDO, and child dropout rate is CDO. In some embodiments, the mother dropout rate can be assumed to be zero, and child dropout rates are relatively low, so the results in practice are not severely affected by dropouts. Nonetheless, they have been incorporated into the algorithm here. Elsewhere, lik(x_i|m_i, c, cf)=pdf_x(x_i) has been defined as the likelihood of getting x_iprobability of A on SNP i, given sequence measurements S, assuming true mother m_i, true child c. If there is a dropout in the mother or child, the input data is NOT true mother(m_i) or child(c), but mother after possible dropout (m_d) and child after a possible dropout (c_d). One can then rewrite the above formula as

$lik (x_{i} | m_{i}, c, cf) = \sum_{m_{d}, c_{d}} p (m_{d} | m_{i}) * p (c_{d} | c) * lik (x_{i} | m_{d}, c_{d}, cf)$

where p(m_d51 m_i) is the probability of new mother genotype md, given true mother genotype m, assuming dropout rate mdo, and p(c_d|c) is the probability of new child genotype c_d, given true child genotype c, assuming dropout rate CDO. If nA_T=number of A alleles in true genotype c, nA_D=number of A alleles in ‘drop’ genotype c_d, where nA_T≥nA_D, and similarly nB_T=number of B alleles in true genotype c, nB_D=number of B alleles in ‘drop’ genotype c_d, where nB_T>nB_Dand d=dropout rate, then

$p (c_{d} | c) = (\begin{matrix} n A_{T} \\ n A_{D} \end{matrix}) * d^{n A_{T} - n A_{D}} * {(1 - d)}^{n A_{D}} * (\begin{matrix} {nB}_{T} \\ {nB}_{D} \end{matrix}) * d^{{nB}_{T} - {nB}_{D}} * {(1 - d)}^{{nB}_{D}}$

For one set of experimental data, the parent genotypes have been measured, as well as the true child genotype, where the child has maternal trisomy on chromosomes 14 and 21. Sequencing measurements have been simulated for varying values of child fraction, N distinct SNPs, and total number of reads NR. From this data it is possible to derive the most likely child fraction, and derive copy number assuming known or derived child fraction.

In one embodiment, the method disclosed herein can be used to determine a fetal aneuploidy by determining the number of copies of maternal and fetal target chromosomes, having target sequences in a mixture of maternal and fetal genetic material. This method may entail obtaining maternal tissue containing both maternal and fetal genetic material; in some embodiments this maternal tissue may be maternal plasma or a tissue isolated from maternal blood. This method may also entail obtaining a mixture of maternal and fetal genetic material from said maternal tissue by processing the aforementioned maternal tissue. This method may entail distributing the genetic material obtained into a plurality of reaction samples, to randomly provide individual reaction samples that contain a target sequence from a target chromosome and individual reaction samples that do not contain a target sequence from a target chromosome, for example, performing high throughput sequencing on the sample. This method may entail analyzing the target sequences of genetic material present or absent in said individual reaction samples to provide a first number of binary results representing presence or absence of a presumably euploid fetal chromosome in the reaction samples and a second number of binary results representing presence or absence of a possibly aneuploid fetal chromosome in the reaction samples. Either of the number of binary results may be calculated, for example, by way of an informatics technique that counts sequence reads that map to a particular chromosome, to a particular region of a chromosome, to a particular locus or set of loci. This method may involve normalizing the number of binary events based on the chromosome length, the length of the region of the chromosome, or the number of loci in the set. This method may entail calculating an expected distribution of the number of binary results for a presumably euploid fetal chromosome in the reaction samples using the first number. This method may entail calculating an expected distribution of the number of binary results for a presumably aneuploid fetal chromosome in the reaction samples using the first number and an estimated fraction of fetal DNA found in the mixture, for example, by multiplying the expected read count distribution of the number of binary results for a presumably euploid fetal chromosome by (1+n/2) where n is the estimated fetal fraction. The fetal fraction may be estimated by a plurality of methods, some of which are described elsewhere in this disclosure. This method may involve using a maximum likelihood approach to determine whether the second number corresponds to the possibly aneuploid fetal chromosome being euploid or being aneuploid. This method may involve calling the ploidy status of the fetus to be the ploidy state that corresponds to the hypothesis with the maximum likelihood of being correct given the measured data.

Simplified Explanation for Allele Ratio Method for Ploidy Calling in NPD

In one embodiment the ploidy state of a gestating fetus may be determined using a method that looks at allele ratios. Some methods determine fetal ploidy state by comparing numerical sequencing output DNA counts from a suspect chromosome to a reference euploid chromosome. In contrast to that concept, the allele ratio method determines fetal ploidy state by looking at allele ratios for different parental contexts on one chromosome. This method has no need to use a reference chromosome. For example, imagine the following possible ploidy states, and the allele ratios for various parental contexts:

(note: ratio ‘r’ is defined as follows: 1/r=fraction mother DNA/fraction fetal DNA)

Child Parent A:B Child A:B geno- A:B Child context Euploidy genotype P-U tri* type P-M tri* genotype AA|BB 2 + r:r AB 2 + r:2r ABB 2 + r:2r ABB BB|AA r:2 + r AB 2 + 2r:r AAB 2 + 2r:r AAB AA|AB 1:0 AA 2 + 2r:r AAB 1:0 AAA AA|AB 2 + r:r AB — — 2 + 2r:r AAB AA|AB 4 + 2r:r average — — 4 + 4r:r average *P-U tri = paternal matching trisomy; P-M tri = paternal matching trisomy;

Note that this table represents only a subset of the parental contexts and a subset of the possible ploidy states that this method is designed to differentiate. In this case, one can determine the A:B ratios for a plurality of alleles from a set of parental contexts in a set of sequencing data. One can then state a number of hypothesis for each ploidy state, and for each value of r; each hypothesis will have an expected pattern of A:B ratios for the different parental contexts. One can then determine which hypothesis best fits the experimental data.

For example, using the above set of parental contexts, and the value of r=0.2, one can rewrite the chart as follows: (For example, one can calculate [# reads of allele A/# reads of allele B]; thus 2+r:r becomes 2+0.2:0.2→2.2:0.2=11)

Now, one can look at the ratios between the A:B ratios for different parental contexts. In this case, one may expect the A:B_AA|BB/A:B_AA|ABto be 11/21=0.524 on average for euploidy; to be 5.5/12=0.458 on average for a paternal unmatched trisomy, and 5.5/44=0.125 on average for a paternal matching trisomy. The profile of A:B ratios among different contexts will be different for different ploidy states, and the profiles should be distinctive enough that it will be possible to determine the ploidy state for a chromosome with high accuracy. Note that the calculated value of r may be determined using a different method, or it can be determined using a maximum likelihood approach to this method. In one embodiment, the method requires the maternal genotypic knowledge. In one embodiment the method requires paternal genotypic knowledge. In one embodiment the method does not require paternal genotypic knowledge. In an embodiment, the percent fetal fraction and the ratio of maternal to fetal DNA are essentially equivalent, and can be used interchangeably after applying the appropriate linear algebraic transformation. In some embodiments, r=[percent fetal fraction]/[1−percent fetal fraction].

SNP Classification Using Phred Scores

The phred score, q, is defined as follows: P(wrong base call)=10̂(−q/10) Let x=reference ratio of true genotype=number of reference alleles/number of total alleles. For disomy, x in {0, 0.5, 1} corresponds to {MM, RM, RR}. Let z be the allele observed in a sequence, z in {R, M}. Here the likelihood of observing z=R is shown, conditioned on the true ratio of reference alleles in the genotype (ie, what is P(z=R|x)

P(z=R|x)=P(z=R|gc, x)P(gc)+P(z=R|bc,x)P(bc)
where gc is the event of a correct call and be is the event of a bad call.

P(gc) and P(bc) are calculated from the phred score. P(z=R|gc,x)=x and P(z=R|bc,x)=1−x, assuming that probes are unbiased.

Result, where b=P(wrong base call): P(z=R|x)=x(1−b)+(1−x)*b

Note that the probability of a reference allele measurement converges to the reference allele ratio as the phred score improves, as expected.

Assuming that each sequence is generated independently, conditioned on the true genotype, the likelihood of a set of measurements at the same SNP is simply the product of the individual likelihoods. This method accounts for varying phred scores. In another embodiment, it is possible to account for varying confidence in the sequence mapping. Given the set of n sequences for a single SNP, the combination of likelihoods results in a polynomial of order n that can be evaluated at the candidate allele ratios that represent the various hypotheses.

SNP Classification Using Phred Threshold

When a large number of sequences are available for a single SNP, the polynomial likelihood function on the allele ratio becomes intractable. An alternative is to consider only the base calls which have high phred score, and then assume that they are accurate. Each base read is now an HD Bernoulli according to the true allele ratio, and the likelihood function is Gaussian. If r is the ratio of reference reads in the data, the likelihood function on x (the true reference allele ratio) has mean=r and standard deviation=sqrt(r*(1−r)/n).

SNP Bias Correlation Across Samples

Using the two likelihood functions discussed above (polynomial, Gaussian) a SNP can be classified as RR, RM, or MM by considering the allele ratios {1, 0.5, 0}, or a maximum likelihood estimate of the allele ratio can be calculated. When the same SNP is classified as RM in two different samples, it is possible to compare the MLE estimates of the allele ratio to look for consistent “probe bias.”

Using Sequence Length as a Prior to Determine the Origin of DNA

It has been reported that the distribution of length of sequences differ for maternal and fetal DNA, with fetal generally being shorter. In one embodiment of the present disclosure, it is possible to use previous knowledge in the form of empirical data, and construct prior distribution for expected length of both mother(P(X|maternal)) and fetal DNA (P(X|fetal)). Given new unidentified DNA sequence of length x, it is possible to assign a probability that a given sequence of DNA is either maternal or fetal DNA, based on prior likelihood of x given either maternal or fetal. In particular if P(x|maternal)>P(x|fetal), then the DNA sequence can be classified as maternal, with P(x|maternal)=P(x|maternal)/[(P(x|maternal)+P(x|fetal)], and if p(x|maternal)<p(x|fetal), then the DNA sequence can be classified as fetal, P(x|fetal)=P(x|fetal)/[(

P(x|maternal)+P(x|fetal)]. In one embodiment of the present disclosure, a distributions of maternal and fetal sequence lengths can be determined that is specific for that sample by considering the sequences that can be assigned as maternal or fetal with high probability, and then that sample specific distribution can be used as the expected size distribution for that sample.

Methods for Determining the Average Copy Number in a Set of Target Cells

The methods described above assume that the DNA from the target cell is from one target cell, or else from target cells which are essentially genetically identical. There are circumstances where this assumption may not hold, for example, in the case of placental mosaicism, where the target is a fetus, and the DNA from the fetus originates from a plurality of cells where some of the placental cells are genetically distinct from other placental cells. For example, in many some case where the fetus is 47,XX+18 or 47,XY+18, the placenta is mosaic—a mixture of 46,XX and 47,XX+18 or 46,XY and 47,XY+18 respectively.

Another example involves detection of cancer through copy number variants, where the target cells are from a tumor, and where the non-target cells are non-cancerous cells from the host. The hallmark of cancer is the instability of the genome, and in many if not all cases, tumors are genetically heterogeneous. Even small biopsies of tumor tissue show heterogeneity. The ways in which the genome of the cancerous cells differ from the native host DNA are considered mutations; some but not necessarily all of these mutations may drive the oncogenic properties of the cancer. In the case of a liquid biopsy, i.e. detection of tumor DNA from cell free DNA (cfDNA) in the blood stream, the cell-free tumor DNA (ctDNA) is believed to originate from apoptotic or necrotic cancer cells, which are often heterogeneous, and are representative of some or all of the cells of the tumor. There are a number of types of mutations that are seen in cancers, including but not limited to point mutations, also called single nucleotide variants (SNVs), copy number variants (CNVs), hypomethylation, hypermethylation, deletions, and duplications.

If one considers the normal disomic genome of the host to be the baseline, then analysis of a mixture of normal and cancer cells will yield the average difference between the baseline and the DNA from the cells of origin of the ctDNA in the mixture. For example, imagine a case where 10% of the DNA in the sample originated from a cells with a deletion over a region of a chromosome that is targeted by the assay. A quantitative approach should show that the quantity of reads corresponding to that region would be expected to be 95% of what would be expected for a normal sample. This is because one of the two target chromosomal regions in each of the tumor cells with a deletion on of the targeted region is missing, and thus the total amount of DNA mapping to that region would be 90% (for the normal cells) plus ½×10% (for the tumor cells)=95%. Alternately, an allelic approach should show that the ratio of alleles at heterozygous loci averaged 19:20. Now imagine a case where 10% of the DNA in the sample originated from a cells with a five-fold focal amplification of a region of a chromosome that is targeted by the assay. A quantitative approach should show that the quantity of reads corresponding to that region would be expected to be 125% of what would be expected for a normal sample. This is because one of the two target chromosomal regions in each of the tumor cells with a five-fold focal amplification is copied an extra five times over the targeted region, and thus the total amount of DNA mapping to that region would be 90% (for the normal cells) plus (2+5)×10%/2 (for the tumor cells)=125%. Alternately, an allelic approach should show that the ratio of alleles at heterozygous loci averaged 25:20. Note that when using an allelic approach alone, a focal amplification of five-fold over a chromosomal region in a sample with 10% ctDNA may appear the same as a deletion over the same region in a sample with 40% ctDNA; in these two cases, the haplotype that is under-represented in the case of the deletion would appear to be the haplotype without a CNV in the case with the focal duplication, and the haplotype without a CNV in the case of the deletion would appear to be the over-represented haplotype in the case with the focal duplication. Combining the likelihoods produced by this allelic approach with likelihoods produced by a quantitative approach would differentiate between the two possibilities.

In certain embodiments, provided herein are kits for performing any of the methods for detecting aneuploidy provided herein, that include at least one tube of at least one reagent for performing such method and a computer readable medium or an access code to an online computer program, to perform one or more of the analytical techniques used in the method. For example, a kit in certain embodiments, includes a tube of oligonucleotides for amplifying a chromosome region of interest that includes a locus, and an access code for unlocking online software for making an initial copy number determination or for making a confirmatory copy number determination. The kit can further include a tube with one or more reagents for amplifying the locus. The components of the kit can be contained in the same physical container (e.g. box) or they can be arranged together on an Internet page.

Sample Preparation Exemplary Sample Preparation Methods

In some embodiments, methods of the invention includs isolating or purifying the DNA and/or RNA. There are a number of standard procedures known in the art to accomplish such an end. In some embodiments, the sample may be centrifuged to separate various layers. In some embodiments, the DNA or RNA may be isolated using filtration. In some embodiments, the preparation of the DNA or RNA may involve amplification, separation, purification by chromatography, liquid liquid separation, isolation, preferential enrichment, preferential amplification, targeted amplification, or any of a number of other techniques either known in the art or described herein. In some embodiments for the isolation of DNA, RNase is used to degrade RNA. In some embodiments for the isolation of RNA, DNase (such as DNase I from Invitrogen, Carlsbad, Calif., USA) is used to degrade DNA. In some embodiments, an RNeasy mini kit (Qiagen), is used to isolate RNA according to the manufacturer's protocol. In some embodiments, small RNA molecules are isolated using the mirVana PARIS kit (Ambion, Austin, Tex., USA) according to the manufacturer's protocol (Gu et al., J. Neurochem. 122:641-649, 2012, which is hereby incorporated by reference in its entirety). The concentration and purity of RNA may optionally be determined using Nanovue (GE Healthcare, Piscataway, N.J., USA), and RNA integrity may optionally be measured by use of the 2100 Bioanalyzer (Agilent Technologies, Santa Clara, Calif., USA) (Gu et al., J. Neurochem. 122:641-649, 2012, which is hereby incorporated by reference in its entirety). In some embodiments, TRIZOL or RNAlater (Ambion) is used to stabilize RNA during storage.

In some embodiments, universal tagged adaptors are added to make a library from isolated nucleic acids. Prior to ligation, sample DNA may be blunt ended, and then a single adenosine base is added to the 3-prime end. Prior to ligation the DNA may be cleaved using a restriction enzyme or some other cleavage method. During ligation the 3-prime adenosine of the sample fragments and the complementary 3-prime tyrosine overhang of adaptor can enhance ligation efficiency. In some embodiments, adaptor ligation is performed using the ligation kit found in the AGILENT SURESELECT kit. In some embodiments, the library is amplified using universal primers. In an embodiment, the amplified library is fractionated by size separation or by using products such as AGENCOURT AMPURE beads or other similar methods. In some embodiments, PCR amplification is used to amplify target loci. In some embodiments, the amplified DNA is sequenced (such as sequencing using an ILLUMINA IIGAX or HiSeq sequencer). In some embodiments, the amplified DNA is sequenced from each end of the amplified DNA to reduce sequencing errors. If there is a sequence error in a particular base when sequencing from one end of the amplified DNA, there is less likely to be a sequence error in the complementary base when sequencing from the other side of the amplified DNA (compared to sequencing multiple times from the same end of the amplified DNA).

In some embodiments, whole genome application (WGA) is used to amplify a nucleic acid sample. There are a number of methods available for WGA: ligation-mediated PCR (LM-PCR), degenerate oligonucleotide primer PCR (DOP-PCR), and multiple displacement amplification (MDA). In LM-PCR, short DNA sequences called adapters are ligated to blunt ends of DNA. These adapters contain universal amplification sequences, which are used to amplify the DNA by PCR. In DOP-PCR, random primers that also contain universal amplification sequences are used in a first round of annealing and PCR. Then, a second round of PCR is used to amplify the sequences further with the universal primer sequences. MDA uses the phi-29 polymerase, which is a highly processive and non-specific enzyme that replicates DNA and has been used for single-cell analysis. In some embodiments, WGA is not performed.

In some embodiments, selective amplification or enrichment are used to amplify or enrich target loci. In some embodiments, the amplification and/or selective enrichment technique may involve PCR such as ligation mediated PCR, fragment capture by hybridization, Molecular Inversion Probes, or other circularizing probes. In some embodiments, real-time quantitative PCR (RT-qPCR), digital PCR, or emulsion PCR, single allele base extension reaction followed by mass spectrometry are used (Hung et al., J Clin Pathol 62:308-313, 2009, which is hereby incorporated by reference in its entirety). In some embodiments, capture by hybridization with hybrid capture probes is used to preferentially enrich the DNA. In some embodiments, methods for amplification or selective enrichment may involve using probes where, upon correct hybridization to the target sequence, the 3-prime end or 5-prime end of a nucleotide probe is separated from the polymorphic site of a polymorphic allele by a small number of nucleotides. This separation reduces preferential amplification of one allele, termed allele bias. This is an improvement over methods that involve using probes where the 3-prime end or 5-prime end of a correctly hybridized probe are directly adjacent to or very near to the polymorphic site of an allele. In an embodiment, probes in which the hybridizing region may or certainly contains a polymorphic site are excluded. Polymorphic sites at the site of hybridization can cause unequal hybridization or inhibit hybridization altogether in some alleles, resulting in preferential amplification of certain alleles. These embodiments are improvements over other methods that involve targeted amplification and/or selective enrichment in that they better preserve the original allele frequencies of the sample at each polymorphic locus, whether the sample is pure genomic sample from a single individual or mixture of individuals

In some embodiments, PCR (referred to as mini-PCR) is used to generate very short amplicons (U.S. application Ser. No. 13/683,604, filed Nov. 21, 2012, U.S. Publication No. 2013/0123120, U.S. application Ser. No. 13/300,235, filed Nov. 18, 2011, U.S. Publication No 2012/0270212, filed Nov. 18, 2011, and U.S. Ser. No. 61/994,791, filed May 16, 2014, which are each hereby incorporated by reference in its entirety). cfDNA (such as fetal cfDNA in maternal serum or necroptically- or apoptotically-released cancer cfDNA) is highly fragmented. For fetal cfDNA, the fragment sizes are distributed in approximately a Gaussian fashion with a mean of 160 bp, a standard deviation of 15 bp, a minimum size of about 100 bp, and a maximum size of about 220 bp. The polymorphic site of one particular target locus may occupy any position from the start to the end among the various fragments originating from that locus. Because cfDNA fragments are short, the likelihood of both primer sites being present the likelihood of a fragment of length L comprising both the forward and reverse primers sites is the ratio of the length of the amplicon to the length of the fragment. Under ideal conditions, assays in which the amplicon is 45, 50, 55, 60, 65, or 70 bp will successfully amplify from 72%, 69%, 66%, 63%, 59%, or 56%, respectively, of available template fragment molecules. In certain embodiments that relate most preferably to cfDNA from samples of individuals suspected of having cancer, the cfDNA is amplified using primers that yield a maximum amplicon length of 85, 80, 75 or 70 bp, and in certain preferred embodiments 75 bp, and that have a melting temperature between 50 and 65° C., and in certain preferred embodiments, between 54-60.5° C. The amplicon length is the distance between the 5-prime ends of the forward and reverse priming sites. Amplicon length that is shorter than typically used by those known in the art may result in more efficient measurements of the desired polymorphic loci by only requiring short sequence reads. In an embodiment, a substantial fraction of the amplicons are less than 100 bp, less than 90 bp, less than 80 bp, less than 70 bp, less than 65 bp, less than 60 bp, less than 55 bp, less than 50 bp, or less than 45 bp.

In some embodiments, amplification is performed using direct multiplexed PCR, sequential PCR, nested PCR, doubly nested PCR, one-and-a-half sided nested PCR, fully nested PCR, one sided fully nested PCR, one-sided nested PCR, hemi-nested PCR, hemi-nested PCR, triply hemi-nested PCR, semi-nested PCR, one sided semi-nested PCR, reverse semi-nested PCR method, or one-sided PCR, which are described in U.S. application Ser. No. 13/683,604, filed Nov. 21, 2012, U.S. Publication No. 2013/0123120, U.S. Application Ser. No. 13/300,235, filed Nov. 18, 2011, U.S. Publication No 2012/0270212, and U.S. Ser. No. 61/994,791, filed May 16, 2014, which are hereby incorporated by reference in their entirety. If desired, any of these methods can be used for mini-PCR.

If desired, the extension step of the PCR amplification may be limited from a time standpoint to reduce amplification from fragments longer than 200 nucleotides, 300 nucleotides, 400 nucleotides, 500 nucleotides or 1,000 nucleotides. This may result in the enrichment of fragmented or shorter DNA (such as fetal DNA or DNA from cancer cells that have undergone apoptosis or necrosis) and improvement of test performance.

In some embodiments, multiplex PCR is used. In some embodiments, the method of amplifying target loci in a nucleic acid sample involves (i) contacting the nucleic acid sample with a library of primers that simultaneously hybridize to least 100; 200; 500; 750; 1,000; 2,000; 5,000; 7,500; 10,000; 20,000; 25,000; 30,000; 40,000; 50,000; 75,000; or 100,000 different target loci to produce a reaction mixture; and (ii) subjecting the reaction mixture to primer extension reaction conditions (such as PCR conditions) to produce amplified products that include target amplicons. In some embodiments, at least 50, 60, 70, 80, 90, 95, 96, 97, 98, 99, or 99.5% of the targeted loci are amplified. In various embodiments, less than 60, 50, 40, 30, 20, 10, 5, 4, 3, 2, 1, 0.5, 0.25, 0.1, or 0.05% of the amplified products are primer dimers. In some embodiments, the primers are in solution (such as being dissolved in the liquid phase rather than in a solid phase). In some embodiments, the primers are in solution and are not immobilized on a solid support. In some embodiments, the primers are not part of a microarray. In some embodiments, the primers do not include molecular inversion probes (MIPs).

In some embodiments, two or more (such as 3 or 4) target amplicons (such as amplicons from the miniPCR method disclosed herein) are ligated together and then the ligated products are sequenced. Combining multiple amplicons into a single ligation product increases the efficiency of the subsequent sequencing step. In some embodiments, the target amplicons are less than 150, 100, 90, 75, or 50 base pairs in length before they are ligated. The selective enrichment and/or amplification may involve tagging each individual molecule with different tags, molecular barcodes, tags for amplification, and/or tags for sequencing. In some embodiments, the amplified products are analyzed by sequencing (such as by high throughput sequencing) or by hybridization to an array, such as a SNP array, the ILLUMINA INFINIUM array, or the AFFYMETRIX gene chip. In some embodiments, nanopore sequencing is used, such as the nanopore sequencing technology developed by Genia (see, for example, the world wide web at geniachip.com/technology, which is hereby incorporated by reference in its entirety). In some embodiments, duplex sequencing is used (Schmitt et al., “Detection of ultra-rare mutations by next-generation sequencing,” Proc Natl Acad Sci USA. 109(36): 14508-14513, 2012, which is hereby incorporated by reference in its entirety). This approach greatly reduces errors by independently tagging and sequencing each of the two strands of a DNA duplex. As the two strands are complementary, true mutations are found at the same position in both strands. In contrast, PCR or sequencing errors result in mutations in only one strand and can thus be discounted as technical error. In some embodiments, the method entails tagging both strands of duplex DNA with a random, yet complementary double-stranded nucleotide sequence, referred to as a Duplex Tag. Double-stranded tag sequences are incorporated into standard sequencing adapters by first introducing a single-stranded randomized nucleotide sequence into one adapter strand and then extending the opposite strand with a DNA polymerase to yield a complementary, double-stranded tag. Following ligation of tagged adapters to sheared DNA, the individually labeled strands are PCR amplified from asymmetric primer sites on the adapter tails and subjected to paired-end sequencing. In some embodiments, a sample (such as a DNA or RNA sample) is divided into multiple fractions, such as different wells (e.g., wells of a WaferGen SmartChip). Dividing the sample into different fractions (such as at least 5, 10, 20, 50, 75, 100, 150, 200, or 300 fractions) can increase the sensitivity of the analysis since the percent of molecules with a mutation are higher in some of the wells than in the overall sample. In some embodiments, each fraction has less than 500, 400, 200, 100, 50, 20, 10, 5, 2, or 1 DNA or RNA molecules. In some embodiments, the molecules in each fraction are sequenced separately. In some embodiments, the same barcode (such as a random or non-human sequence) is added to all the molecules in the same fraction (such as by amplification with a primer containing the barcode or by ligation of a barcode), and different barcodes are added to molecules in different fractions. The barcoded molecules can be pooled and sequenced together. In some embodiments, the molecules are amplified before they are pooled and sequenced, such as by using nested PCR. In some embodiments, one forward and two reverse primers, or two forward and one reverse primers are used.

The use of a method to target certain alleles followed by sequencing as part of a method for allele calling or ploidy calling may confer a number of unexpected advantages. Some methods by which DNA may be targeted, or selectively enriched, include using circularizing probes, linked inverted probes (LIPs), capture by hybridization methods such as SURE SELECT, and targeted PCR amplification strategies.

Some embodiments of the present disclosure involve the use of “Linked Inverted Probes” (LIPs), which have been previously described in the literature. LIPs is a generic term meant to encompass technologies that involve the creation of a circular molecule of DNA, where the probes are designed to hybridize to targeted region of DNA on either side of a targeted allele, such that addition of appropriate polymerases and/or ligases, and the appropriate conditions, buffers and other reagents, will complete the complementary, inverted region of DNA across the targeted allele to create a circular loop of DNA that captures the information found in the targeted allele. LIPs may also be called pre-circularized probes, pre-circularizing probes, or the circularizing probes. The LIPs probe may be a linear DNA molecule between 50 and 500 nucleotides in length, and in a preferred embodiment between 70 and 100 nucleotides in length; in some embodiments, it may be longer or shorter than described herein. Others embodiments of the present disclosure involve different incarnations, of the LIPs technology, such as Padlock Probes and Molecular Inversion Probes (MIPs).

InThere are many methods that may be used to measure the genetic data of the individual and/or the related individuals in the aforementioned contexts. The different methods comprise a number of steps, those steps often involving amplification of genetic material, addition of olgionucleotide probes, ligation of specified DNA strands, isolation of sets of desired DNA, removal of unwanted components of a reaction, detection of certain sequences of DNA by hybridization, detection of the sequence of one or a plurality of strands of DNA by DNA sequencing methods. In some cases, the DNA strands may refer to target genetic material, in some cases they may refer to primers, in some cases they may refer to synthesized sequences, or combinations thereof. These steps may be carried out in a number of different orders. Given the highly variable nature of molecular biology, it is generally not obvious which methods, and which combinations of steps, will perform poorly, well, or best in various situations.

Note that in theory it is possible to target any number loci in the genome, anywhere from one loci to well over one million loci. If a sample of DNA is subjected to targeting, and then sequenced, the percentage of the alleles that are read by the sequencer will be enriched with respect to their natural abundance in the sample. The degree of enrichment can be anywhere from one percent (or even less) to tens fold, hundred fold, thousand fold or even many million fold. In the human genome there are roughly 3 billion base pairs, and nucleotides, containing approximately 75 million polymorphic loci. The more loci that are targeted, the smaller the degree of enrichment is possible. The fewer the number of loci that are targeted, the greater degree of enrichment is possible, and the greater depth of read may be achieved at those loci for a given number of sequence reads.

In one embodiment of the present disclosure, the targeting may focus entirely on SNPs. A number of commercial targeting products are available to enrich exons. Targeting exclusively loci that include SNPs is particularly advantageous when using a method for NPD that relies on allele distributions. In one embodiment of the present disclosure, it is possible to use a targeting method that focuses on SNPs to enrich a genetic sample in polymorphic regions of the genome. In one embodiment, it is possible to focus on a small number of SNPs, for example between 1 and 100 SNPs, or a larger number, for example, between 100 and 1,000, between 1,000 and 10,000, between 10,000 and 100,000 or more than 100,000 SNPs. In one embodiment, it is possible to focus on one or a small number of chromosomes that are correlated with live trisomic births, for example chromosomes 13, 18, 21, X and Y, or some combination thereof. In one embodiment, it is possible to enrich the targeted SNPs by a small factor, for example between 1.01 fold and 100 fold, or by a larger factor, for example between 100 fold and 1,000,000 fold. In one embodiment of the present disclosure, it is possible to use a targeting method to create a sample of DNA that is preferentially enriched in polymorphic regions of the genome. In one embodiment, it is possible to use the method to create a sample of DNA that is preferentially enriched in a small number of SNPs, for example between 1 and 100 SNPs, or a larger number of SNPs, for example, between 100 and 50,000 SNPs. In one embodiment, it is possible to use the method to create a DNA sample that is enriched in SNPs located on one or a small number of chromosomes that are correlated with live trisomic births, for example chromosomes 13, 18, 21, X and Y, or some combination thereof. In one embodiment, it is possible to use the method to create a sample of DNA that is preferentially enriched in a small number of SNPs, for example between 1 and 100 SNPs, or a larger number of SNPs, for example, between 100 and 50,000 SNPs. In one embodiment, it is possible to use the method to create a sample of DNA that is enriched targeted SNPs by a small factor, for example between 1.01 fold and 100 fold, or by a larger factor, for example between 100 fold and 1,000,000 fold. In one embodiment, it is possible to use this method to create a mixture of DNA with any of these characteristics where the mixture of DNA contains maternal DNA and also free floating fetal DNA. In one embodiment, it is possible to use this method to create a mixture of DNA that has any combination of these factors. For example, a mixture of DNA that contains maternal DNA and fetal DNA, and that is preferentially enriched in 200 SNPs, all of which are located on either chromosome 18 or 21, and which are enriched an average of 1000 fold. In another example, it is possible to use the method to create a mixture of DNA that is preferentially enriched in 50,000 SNPs that are all located on chromosomes 13, 18, 21, X and Y, and the average enrichment per loci is 200 fold. Any of the targeting methods described herein can be used to create mixtures of DNA that are preferentially enriched in certain loci.

In some embodiments, the method may further comprise measuring the DNA contained in the mixed fraction using a DNA sequencer, and the DNA contained in the mixed fraction contains a disproportionate number of sequences from one or more chromosomes, wherein the one or more chromosomes are selected from the group consisting of chromosome 13, chromosome 18, chromosome 21, chromosome X, chromosome Y and combinations thereof.

In one embodiment, once a mixture has been preferentially enriched at the set of target loci, it may be sequenced using any one of the previous, current, or next generation of sequencing instruments that sequences a clonal sample (a sample generated from a single molecule; examples include ILLUMINA GAIIx, ILLUMINA HISEQ or MiSEQ, LIFE TECHNOLOGIES SOLiD, 5500XL, or Ion Torrent PGM or Proton). The ratios can be evaluated by sequencing through the specific alleles within the targeted region. These sequencing reads can be analyzed and counted according the allele type and the rations of different alleles determined accordingly. For variations that are one to a few bases in length, detection of the alleles will be performed by sequencing and it is essential that the sequencing read span the allele in question in order to evaluate the allelic composition of that captured molecule. The total number of captured molecules assayed for the genotype can be increased by increasing the length of the sequencing read. Full sequencing of all molecules would guarantee collection of the maximum amount of data available in the enriched pool. However, sequencing is currently expensive, and a method that can measure a certain number of allele ratios using a lower number of sequence reads will have great value. In addition, there are technical limitations to the maximum possible length of read as well as accuracy limitations as read lengths increase. The alleles of greatest utility will be of one to a few bases in length, but theoretically any allele shorter than the length of the sequencing read can be used. While allele variations come in all types, the examples provided herein focus on SNPs or variants comprised of just a few neighboring base pairs. Larger variants such as segmental copy number variants can be detected by aggregations of these smaller variations in many cases as whole collections of SNP internal to the segment are duplicated. Variants larger than a few bases, such as STRs require special consideration and some targeting approaches work while others will not. The evaluation of the allelic rations is herein determined

There are multiple targeting approaches that can be used to specifically isolate and enrich a one or a plurality of variant positions in the genome. Typically, these rely on taking advantage of invariant sequence flanking the variant sequence. There is prior art related to targeting in the context of sequencing where the substrate is maternal plasma (see, e.g., Liao et al., Clin. Chem.; 57(1): pp. 92-101). However, these approaches all use targeting probes that target exons, and do not focus on targeting polymorphic regions of the genome. In one embodiment of the present disclosure, the method involves using targeting probes that focus exclusively or almost exclusively on polymorphic regions. In one embodiment of the present disclosure, the method involves using targeting probes that focus exclusively or almost exclusively on SNPs. When polymorphic targeted DNA mixtures are sequenced and analyzed using an algorithm that determined ploidy using allele ratios, this targeting method is able to provide far more accurate ploidy determinations for a given number of sequence reads. In some embodiments of the present disclosure, the targeted polymorphic regions consist of at least 10% SNPs, at least 20% SNPs, at least 30% SNPs, at least 40% SNPs, at least 50% SNPs, at least 60% SNPs, at least 70% SNPs, at least 80% SNPs, at least 90% SNPs, at least 95% SNPs, at least 98% SNPs, at least 99% SNPs, at least 99.9% SNPs, exclusively SNPs.

Targeted Sequencing Using PCR Approaches

In some embodiments, PCR can be used to target specific locations of the genome. In plasma samples, the original DNA is highly fragmented (˜100-200 bp, 150 peak). In PCR, both forward and reverse primers must anneal to the same fragment to enable amplification. Therefore, if the fragments are short, the PCR assays must amplify relatively short regions as well. Like MIPS, if the polymorphic positions are too close the polymerase binding site, it could result in biases in the amplification from different alleles. Currently, PCR primers that target polymorphic regions, such as SNPs, are typically designed such that the 3′ end of the primer will hybridize to the base immediately adjacent to the polymorphic base or bases. In one embodiment of the present disclosure, the 3′ ends of both the forward and reverse PCR primers are designed to hybridize to bases that are one or a few positions away from the variant positions (polymorphic regions) of the targeted allele. The number of bases between the polymorphic region (SNP or otherwise) and the base to which the 3′ end of the primer is designed to hybridize may be one base, it may be two bases, it may be three bases, it may be four bases, it may be five bases, it may be six bases, it may be seven to ten bases, it may be eleven to fifteen bases, or it may be sixteen to twenty bases. The forward and reverse primers may be designed to hybridize a different number of bases away from the polymorphic region.

PCR assay can be generated in large numbers, however, the interactions between different PCR assays makes it difficult to multiplex them beyond about one hundred assays. Various complex molecular approaches can be used to increase the level of multiplexing, but it may still be limited to fewer than 1000 assays per reaction. Samples with large quantities of DNA can be split among multiple sub-reactions and then recombined before sequencing. For samples where either the overall sample or some subpopulation of DNA molecules is limited, splitting the sample would introduce statistical noise. In one embodiment, a small or limited quantity of DNA may refer to an amount below 10 pg, between 10 and 100 pg, between 100 pg and 1 ng, between 1 and 10 ng, or between 10 and 100 ng. Note that while this method is particularly useful on small amounts of DNA where other methods that involve splitting into multiple pools can cause significant problems related to introduced stochastic noise, this method still provides the benefit of minimizing bias when it is run on samples of any quantity of DNA. In these situations, a pre-amplification step may be used to increase the overall sample quantity. However, this pre-amplification step should not appreciably alter the allelic ratios.

In one embodiment, the method can generate hundreds to thousands of PCR products (can be 10,000 and more), e.g. for genotyping by sequencing or some other genotyping method, from limited samples such as single cells or DNA from body fluids. Currently, performing multiplex PCR reactions of more than 5 to 10 targets presents a major challenge and is often hindered by primer side products, such as primer dimers, and other artifacts. In next generation sequencing the vast majority of the sequencing reads would sequence such artifacts and not the desired target sequences in a sample. In general, to perform targeted sequencing of multiple (n) targets of a sample (greater than 10, 50 or 1000's), one can split the sample into n parallel reactions that amplify one individual target, which is problematic for samples with a limited amount of DNA. This has been performed in PCR multiwell plates or can be done in commercial platforms such as the Fluidigm Access Array (48 reactions per sample in microfluidic chips) or droplet PCR by Rain Dance Technologies (100s to a few thousands of targets). Described here is a method to effectively amplify many PCR reactions, that is applicable to cases where only a limited amount of DNA is available. In one embodiment, the method may be applied for analysis of single cells, body fluids, biopsies, environmental and/or forensic samples. Solution:

A) Generate and amplify a library with adaptor sequences on both ends of DNA fragments. Divide into multiple reactions after library amplification.

B) Generate (and possibly amplify) a library with adaptor sequences on both ends of DNA fragments. Perform 1000-plex amplification of selected targets using one target specific “Forward” primer per target and one tag specific primer. One can perform a second amplification from this product using “Reverse” target specific primers and one (or more) primer specific to a universal tag that was introduced as part of the target specific forward primers in the first round.

C) Perform a 1000-plex preamplification of selected target for a limited number of cycles. Divide the product into multiple aliquots and amplify subpools of targets in individual reactions (for example, 50 to 500-plex, though this can be used all the way down to singleplex). Pool products of parallel subpools reactions.

D) During these amplifications primers may carry sequencing compatible tags (partial or full length) such that the products can easily be sequenced.

There is significant diagnostic value in accurately determining the relative proportion of alleles present in a sample. The interpretation of the result depends on the source of the material. In some embodiments of the present disclosure, the allelic ratio information can be used to determine the genetic state of an individual. In some embodiments of the present disclosure, this information can be used to determine the genetic state of a plurality of individuals from one DNA sample, wherein the DNA sample contains DNA from each of the plurality of individuals. In one embodiment, the allelic ratio information can be used to determine copy number of whole chromosomes from individual cells, or bulk samples. In one embodiment, the allelic ratio information can be used to determine copy number of parts, regions, or segments of chromosomes individual cells, or bulk samples. In one embodiment, the allelic ratio information can be used to determine the relative contribution of different cell types in mosaic samples. In one embodiment, the allelic ratio information can be used to determine the fraction of fetal DNA in maternal plasma samples as well as the chromosome copy number of the fetal chromosomes.

Generation of Targeted Sequencing Libraries by PCR of Greater Than 100 Targets

Described herein is a method for amplifying a region of a chromosome of interest that includes a locus of interest by first globally amplify the plasma DNA of a sample and then dividing the sample up into multiple multiplexed target enrichment reactions with multiple target sequences per reaction. In one embodiment, the method can be used for preferentially enriching a DNA mixture at a plurality of loci, the method comprising generating and amplifying a library from a mixture of DNA where the molecules in the library have adaptor sequences ligated on both ends of the DNA fragments, dividing the amplified library into multiple reactions, performing a first round of multiplex amplification of selected targets using one target specific “forward” primer per target and one or a plurality of adaptor specific universal “reverse” primers. In one embodiment, the method may further comprise performing a second amplification using “reverse” target specific primers and one or a plurality of primers specific to a universal tag that was introduced as part of the target specific forward primers in the first round. In one embodiment, the method may be used for preferentially enriching a DNA mixture at a plurality of loci, the method comprising performing a multiplex preamplification of selected targets for a limited number of cycles, dividing the product into multiple aliquots and amplifying subpools of targets in individual reactions, and pooling products of parallel subpools reactions. In one embodiment, the primers carry partial or full length sequencing compatible tags.

Workflow:

1. Extract plasma DNA

2. Prepare fragment library with universal adaptors on both ends of fragments.

3. Amplify library using universal primers specific to the adaptors.

4. Divide the amplified sample “library” into multiple aliquots. Perform multiplex (e.g. 100-plex, or 1000-plex with one target specific primer per target and a tag-specific primer) amplifications on aliquots.

5. Pool aliquots of one sample.

6. Barcode sample if not already done.

7. Mix samples, adjust concentration.

8. Perform sequencing.

The workflow may contain multiple sub-steps that comprise one of the listed steps (e.g. step 2. Library preparation may comprise 3 enzymatic steps (blunt ending, dA tailing and adaptor ligation) and 3 purification steps).

Steps of the workflow may be combined, divided up or performed in different order (e.g. bar coding and pooling of samples).

It is important to note that the amplification of a library can be performed in such a way that it is biased to amplify short fragments more efficiently. In this manner it is possible to preferentially amplify shorter sequences, e.g. mono-nucleosomal DNA fragments as the cell free fetal DNA (of placental origin) found in the circulation of pregnant women.

PCR assays:

Can have the tags for sequencing (usually a truncated form of 15-25 bases). After multiplexing, PCR multiplexes of a sample are pooled and then the tags are completed (including bar coding) by a tag-specific PCR (could also be done by ligation).

The full sequencing tags can be added in the same reaction as the multiplexing. In the first cycles targets are amplified with the target specific primers, subsequently the tag-specific primers take over to complete the SQ-adaptor sequence.

The PCR primers carry no tags. After m.p. PCR the sequencing tags are appended to the amplification products by ligation.

Sequencing results:

The 12 samples were pooled at equal volumes

Pool cleaned into 100 ul Elution buffer

Pool diluted to 30 nM (was 75 nM)

Sent for sequencing

QC by qPCR

preparation of 15 cy replicates
(Orange: 8 Replicates with Barcodes 5 to 12)

15 cycles STA

- (RED STA protocol: 95 C×10 min; 95 C×15 s, 65 C×1 min, 60 C×4 min, 65 C×30 s, 72 C×30 s; 72 C×2 min)
- Used the 50 nM primers reactions
- Performed a first ExoSAP straight from product→failed to remove all primers (Bioanalyzer): just leave this step out in the future.
- Dilute 1/10 (adding 90 ul H₂O)
- 2 ul in 14 ul ExoSAP reaction→dilute to 50 ul=1/25 dilution in this step=total 1/250

Append SQ tags (longer, full F-SQ and R-m.p. adaptor without barcodes):

- 1 ul DNA in 10 ul PCR: F-SQ x R-SQ-m.p.; concentrations: 200 nM?
- 15 cycles: 95 C×10 min; 95 C×15 s, 60 C×30 s, 65 C×15 s, 72 C×30 s; 72 C×2 min
- Add 90 ul H2O, use 1 ul for next step, primer carry over will be 1/100 of conc in this reaction

Barcoding PCR (p.9 quick book):

- 1 ul DNA in 10 ul PCR: F-SQ x R-SQ-BC1 to 12-lib.; concentrations: 1 uM
- 15 cycles: 95 C×10 min; 95 C×15 s, 60 C×15 s, 72 C×30 s; 72 C×2 min
- Add 40 ul H2O

→check 1 ul on Bioanalyzer DNA1000 chip→pool samples→clean up→Bioanalyzer, adjust conc→sequencing

prep of 30 cy replicate
(Yellow: 1 Replicates with Barcode 4 into Sequencing)

30 cycles STA

- (Yellow STA protocol: 95 C×10 min; 95 C×15 s, 65C×1 min, 60 C×4 min, 65 C×30 s, 72 C×30s; 72 C×2 min)
- Used the 50 nM primers reactions
- Performed a first ExoSAP straight from product failed to remove all primers (Bioanalyzer): just leave this step out in the future.
- Dilute 1/10 (adding 90 ul H2O)
- Dilute 1/100→1/25 dilution=total 1/25,000
- Probably did not perform ExoSAP clean up, small uncertainty from notes

Append SQ tags (longer, full F-SQ and R-m.p. adaptor without barcodes):

- 1 ul DNA in 10 ul PCR: F-SQ×R-SQ-m.p.; concentrations: 200 nM?
- 15 cycles: 95 C×10 min; 95 C×15 s, 60 C×30 s, 65 C×15 s, 72 C×30s; 72 C×2 min
- Add 90 ul H2O, use 1 ul for next step, primer carry over will be 1/100 of conc in this reaction

Barcoding PCR (p.9 quick book):

- 1 ul DNA in 10 ul PCR: F-SQ×R-SQ-BC1 to 12-lib.; concentrations: 1 uM
- 15 cycles: 95 C×10 min; 95 C×15 s, 60 C×15 s, 72 C×30 s; 72 C×2 min
- Add 40 ul H2O

→check 1 ul on Bioanalyzer DNA1000 chip→pool samples→clean up→Bioanalyzer, adjust conc→sequencing

Prep of 1000-plex reactions
(Blue: 1000-Plex; from Amplified SQ Libraries (p.32 Lab Book BZ1))

BC2=ASQ8=pregnancy plasma 2666 or 2687; BC3=ASQ4=apo sup 16777

15 cycles STA

- (RED STA protocol: 95 C×10 min; 95 C×15 s, 65 C×1 min, 60 C×4 min, 65 C×30 s, 72 C×30s; 72 C×2 min)
- 50 nM target specific tagged R-primers and 200 nM F-SQ-primer
- Performed a first ExoSAP straight from product→failed to remove all primers (Bioanalyzer): just leave this step out in the future.
- Dilute 1/5 (adding 40 ul H2O)
- 2 ul in 14 ul ExoSAP reactiondilute to 100 ul=1/50 dilution in this step=total 1/250

Append SQ tags (longer, full F-SQ and R-m.p. adaptor without barcodes):

- 1 ul DNA in 10 ul PCR: F-SQ×R-SQ-m.p.; concentrations: 200 nM?
- 15 cycles: 95 C×10 min; 95 C×15 s, 60 C×30 s, 65 C×15 s, 72 C×30 s; 72 C×2 min
- Add 90 ul H2O, use 1 ul for next step, primer carry over will be 1/100 of conc in this reaction

Barcoding PCR (p.9 quick book):

- 1 ul DNA in 10 ul PCR: F-SQ×R-SQ-BC1 to 12-lib.; concentrations: 1 uM
- 15 cycles: 95 C×10 min; 95 C×15 s, 60 C×15 s, 72 C×30 s; 72 C×2 min
- Add 40 ul H2O

→check 1 ul on Bioanalyzer DNA1000 chip→pool samples→clean up→Bioanalyzer, adjust conc→sequencing

By making use of targeting approaches in sequencing the mixed sample, it may be possible to achieve a certain level of accuracy with fewer sequence reads. The accuracy may refer to sensitivity, it may refer to specificity, or it may refer to some combination thereof. The desired level of accuracy may be between 90% and 95%; it may be between 95% and 98%; it may be between 98% and 99%; it may be between 99% and 99.5%; it may be between 99.5% and 99.9%; it may be between 99.9% and 99.99%; it may be between 99.99% and 99.999%, it may be between 99.999% and 100%. Levels of accuracy above 95% may be referred to as high accuracy.

There are a number of published methods in the prior art that demonstrate how one may determine the ploidy state of a fetus from a mixed sample of maternal and fetal DNA, for example: G. J. W. Liao et al. Clinical Chemistry 2011; 57(1) pp. 92-101. These methods target thousands of locations along each chromosome. The number of locations along a chromosome that may be targeted while still resulting in a high accuracy ploidy determination on a fetus, for a given number of sequence reads, from a mixed sample of DNA is unexpectedly low. In one embodiment of the present disclosure, an accurate ploidy determination may be made by using targeted sequencing, using any method of targeting, for example qPCR, capture by hybridization, or circularizing probes, wherein the number of loci along a chromosome that need to be targeted may be between 1,000 and 500 loci; it may be between 500 and 300 loci; it may be between 300 and 200 loci; it may be between 200 and 150 loci; it may be between 150 and 100 loci; it may be between 100 and 50 loci; it may be between 50 and 20 loci; it may be between 20 and 10 loci. Optimally, it may be between 100 and 500 loci. The high level of accuracy may be achieved by targeting a small number of loci and executing an unexpectedly small number of sequence reads. The number of reads may be between 5 million and 2 million reads; the number of reads may be between 2 million and 1 million; the number of reads may be between 1 million and 500,000; the number of reads may be between 500,000 and 200,000; the number of reads may be between 200,000 and 100,000; the number of reads may be between 100,000 and 50,000; the number of reads may be between 50,000 and 20,000; the number of reads may be between 20,000 and 10,000; the number of reads may be below 10,000.

In some embodiments, there is a composition comprising a mixture of DNA of fetal origin, and DNA of maternal origin, wherein the percent of sequences that uniquely map to chromosome 13 is greater than 4%, greater than 5%, greater than 6%, greater than 7%, greater than 8%, greater than 9%, greater than 10%, greater than 12%, greater than 15%, greater than 20%, greater than 25%, or greater than 30%. In some embodiments of the present disclosure, there is a composition comprising a mixture of DNA of fetal origin, and DNA of maternal origin, wherein the percent of sequences that uniquely map to chromosome 18 is greater than 3%, greater than 4%, greater than 5%, greater than 6%, greater than 7%, greater than 8%, greater than 9%, greater than 10%, greater than 12%, greater than 15%, greater than 20%, greater than 25%, or greater than 30%. In some embodiments of the present disclosure, there is a composition comprising a mixture of DNA of fetal origin, and DNA of maternal origin, wherein the percent of sequences that uniquely map to chromosome 21 is greater than 2%, greater than 3%, greater than 4%, greater than 5%, greater than 6%, greater than 7%, greater than 8%, greater than 9%, greater than 10%, greater than 12%, greater than 15%, greater than 20%, greater than 25%, or greater than 30%. In some embodiments of the present disclosure, there is a composition comprising a mixture of DNA of fetal origin, and DNA of maternal origin, wherein the percent of sequences that uniquely map to chromosome X is greater than 6%, greater than 7%, greater than 8%, greater than 9%, greater than 10%, greater than 12%, greater than 15%, greater than 20%, greater than 25%, or greater than 30%. In some embodiments of the present disclosure, there is a composition comprising a mixture of DNA of fetal origin, and DNA of maternal origin, wherein the percent of sequences that uniquely map to chromosome Y is greater than 1%, greater than 2%, greater than 3%, greater than 4%, greater than 5%, greater than 6%, greater than 7%, greater than 8%, greater than 9%, greater than 10%, greater than 12%, greater than 15%, greater than 20%, greater than 25%, or greater than 30%.

In some embodiments, there is a composition comprising a mixture of DNA of fetal origin, and DNA of maternal origin, wherein the percent of sequences that uniquely map to a chromosome, that contains at least one single nucleotide polymorphism is greater than 0.2%, greater than 0.3%, greater than 0.4%, greater than 0.5%, greater than 0.6%, greater than 0.7%, greater than 0.8%, greater than 0.9%, greater than 1%, greater than 1.2%, greater than 1.4%, greater than 1.6%, greater than 1.8%, greater than 2%, greater than 2.5%, greater than 3%, greater than 4%, greater than 5%, greater than 6%, greater than 7%, greater than 8%, greater than 9%, greater than 10%, greater than 12%, greater than 15%, or greater than 20%, and where the chromosome is taken from the group 13, 18, 21, X, or Y. In some embodiments of the present disclosure, there is a composition comprising a mixture of DNA of fetal origin, and DNA of maternal origin, wherein the percent of sequences that uniquely map to a chromosome and that contain at least one single nucleotide polymorphism from a set of single nucleotide polymorphisms is greater than 0.15%, greater than 0.2%, greater than 0.3%, greater than 0.4%, greater than 0.5%, greater than 0.6%, greater than 0.7%, greater than 0.8%, greater than 0.9%, greater than 1%, greater than 1.2%, greater than 1.4%, greater than 1.6%, greater than 1.8%, greater than 2%, greater than 2.5%, greater than 3%, greater than 4%, greater than 5%, greater than 6%, greater than 7%, greater than 8%, greater than 9%, greater than 10%, greater than 12%, greater than 15%, or greater than 20%, where the chromosome is taken from the set of chromosome 13, 18, 21, X and Y, and where the number of single nucleotide polymorphisms in the set of single nucleotide polymorphisms is between 1 and 10, between 10 and 20, between 20 and 50, between 50 and 100, between 100 and 200, between 200 and 500, between 500 and 1,000, between 1,000 and 2,000, between 2,000 and 5,000, between 5,000 and 10,000, between 10,000 and 20,000, between 20,000 and 50,000, and between 50,000 and 100,000.

In theory, each cycle in the amplification doubles the amount of DNA present, however, in reality, the degree of amplification is slightly lower than two. In theory, amplification, including targeted amplification, will result in bias free amplification of a DNA mixture. When DNA is amplified, the degree of allelic bias typically increases with the number of amplification steps. In some embodiments, the methods described herein involve amplifying DNA with a low level of allelic bias. Since the allelic bias compounds, one can determine the per cycle allelic bias by calculating the nth root of the overall bias where n is the base 2 logarithm of degree of enrichment. In some embodiments, there is a composition comprising a second mixture of DNA, where the second mixture of DNA has been preferentially enriched at a plurality of polymorphic loci from a first mixture of DNA where the degree of enrichment is at least 10, at least 100, at least 1,000, at least 10,000, at least 100,000 or at least 1,000,000, and where the ratio of the alleles in the second mixture of DNA at each locus differs from the ratio of the alleles at that locus in the first mixture of DNA by a factor that is, on average, less than 1,000%, 500%, 200%, 100%, 50%, 20%, 10%, 5%, 2%, 1%, 0.5%, 0.2%, 0.1%, 0.05%, 0.02%, or 0.01%. In some embodiments, there is a composition comprising a second mixture of DNA, where the second mixture of DNA has been preferentially enriched at a plurality of polymorphic loci from a first mixture of DNA where the per cycle allelic bias for the plurality of polymorphic loci is, on average, less than 10%, 5%, 2%, 1%, 0.5%, 0.2%, 0.1%, 0.05%, or 0.02%. In some embodiments, the plurality of polymorphic loci comprises at least 10 loci, at least 20 loci, at least 50 loci, at least 100 loci, at least 200 loci, at least 500 loci, at least 1,000 loci, at least 2,000 loci, at least 5,000 loci, at least 10,000 loci, at least 20,000 loci, or at least 50,000 loci.

EXPERIMENTAL SECTION

The presently disclosed embodiments are described in the following Example, which are set forth to aid in the understanding of the disclosure, and should not be construed to limit in any way the scope of the disclosure as defined in the claims which follow thereafter. The following examples are put forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how to use the described embodiments, and is not intended to limit the scope of the disclosure nor is it intended to represent that the experiments below are all or the only experiments performed. Efforts have been made to ensure accuracy with respect to numbers used (e.g., amounts, temperature, etc.) but some experimental errors and deviations should be accounted for. Unless indicated otherwise, parts are parts by volume, and temperature is in degrees Centigrade. It should be understood that variations in the methods as disclosed may be made without changing the fundamental aspects that the experiments are meant to illustrate.

EXAMPLE 1

This example provides a protocol that was used to validate the performance of a test method for determining the presence or absence of aneuploidy according to the present invention. The test method includes a first allelic analysis method that uses a joint distribution model to identify samples that are high confidence diploid samples that utilizes only data from chromosomes of interest without control chromosomes. The identity of these diploid samples are then passed to a second analysis method that is a non-allelic method that produces a likelihood of a ploidy state. Aneuploid probabilities for each test chromosome for each sample were analyzed for each method and a set of rules were used to determine whether to call a given sample as a high risk sample, that is a sample with a high probability of aneuploidy. The set of rules included at least one rule that combines the aneuploidy confidences from the first method and the second method for a given chromosome of interest for a given sample. The test method eliminates the additional expense and variability introduced by the use of a separate control chromosome. The validation protocol was used to validate test method accuracy with measurements of test sensitivity and specificity on clinical samples (Arm 1) and to validate test method precision with measurements of test reproducibility of clinical sample results and quality control (QC) pass rate (Arm 2).

Background

The test method estimates the fetal copy number of chromosomes 13, 18, 21, X, and Y from a maternal blood sample. The test method utilizes cell free DNA (cfDNA), a mixture of maternal and fetal DNA isolated from the plasma of pregnant women. The cfDNA is first made into a library by ligation of adapters followed by amplification to increase the available total DNA. 13,392 distinct genetic loci are amplified by targeted multiplex PCR, each containing a single nucleotide polymorphism (SNP). The SNP amplicons are then sequenced using next generation sequencing technology to determine the frequency of the SNP alleles at each locus. In parallel, genomic DNA is extracted from the maternal blood cells and, optionally, paternal cheek cells. These genomic DNAs are amplified and sequenced in a similar manner to plasma DNA libraries. The resultant SNP allele ratios from the plasma sample and parental samples are analyzed to create a maximum-likelihood estimate of the fetal chromosome copy number for each targeted chromosome.

After sequencing, the sequence data first goes through a QC process which determines whether the samples have been successfully prepared and are eligible to be run through the Panorama copy number algorithms. If a sample fails in either the QC process, it is typically re-prepared or resequenced. In general, all samples are expected to eventually pass the various QC thresholds. In contrast, the algorithm data review thresholds are the criteria used to determine a chromosome copy number result. These algorithm data review thresholds were only applied to data that has already passed through the QC process.

In Arm 1 of the validation protocol, ≥750 clinical samples (≥300 high risk and ≥450 low risk) were tested to validate that the test sensitivity and specificity meet the product requirements as described in PRD-00104 Requirements Document-NIPT Panorama Rev 04.

In Arm 2 of the validation protocol, 192 samples were split into three daughter replicates that were tested with three lots of selected reagents, three sets of selected instrumentation, three operators, and on three separate days to validate laboratory reproducibility.

Reagents.

Table 1 provides a list of reagents that were used for the execution of the validation protocol. DNA sequencing was carried out on a HiSeq Model 2500 (Illumina, San Diego, Calif.). Thermocycling was performed using a GeneAmp PCR System 9700 (Model N8050001) (Life Technologies, Carlsbad Calif.)

TABLE 1 List of Required Reagents Manufacturer Reagent Manufacturer Part Number 4X Qiagen Multiplex PCR Master Qiagen 1076436 Mix Lots 1-3 5M TMAC lots 1-3 Sigma 639202 cfDNA Multiplex PCR Reagents: Natera 111100 Lots 1-3 cfDNA OneStar Natera 1121144 cfDNA OneStar Natera 1121100 gDNA Multiplex PCR Reagents Natera 121144 gDNA STAR 1 Natera 1221144 gDNA STAR 2 Natera 1222144 Molecular biology grade, DI water Life 10977-023 Technologies F-BC (Barcoding) Primer IDT n/a R-SQ_NB4 Barcode Plates IDT n/a QIAquick PCR Purification Kit Qiagen 28106 3M Sodium Acetate Solution Life AM9740 Technologies Quant-iT dsDNA Broad-Range Assay Life Q33130 Kit (1000) Technologies TruSeq Rapid SR Cluster Kit Illumina TG-402-4000 or GD-402-4001 10 nM Barcoded PhiX (NB2: IDT n/a 271 PhiX) PhiX kit 10 nM stock for cbot Illumina FC-110-3001 (10 UL) TruSeq Rapid SBS Kit (50 Illumina TG-402-4002 or cycle) FC-402-4002 2N Sodium Hydroxide Fisher SS264-1 Scientific 1N Sodium Hydroxide Fisher SS266-1 Scientific

Statistical Approach/Sample Size

Justifications for the sampling strategy and statistical techniques used for each arm of the validation protocol are provided below.

Arm 1 consisted of ≥300 samples known to be from women carrying a fetus with Trisomy 13, 18, or 21, Monosomy X, or triploidy. This positive sample cohort consists of all available samples for which copy number truth has been confirmed. The positive set was selected to produce the best possible measurement of test sensitivity.

≥450 samples known to be from women carrying a euploid fetus were selected for Arm 1. The desired specificity of 0.998 corresponds to one error in 500. The sample set was selected to achieve maximal resolution on the specificity measurement, while maintaining compatibility with the requirements related to automation and plate layout, and practical feasibility given the high cost of running samples. Although the specificity calculation will be performed using a child fraction estimate adjustment (described in the analysis section), the distribution of child fraction estimates in the samples is not known a priori and therefore cannot be used to set the sample size.

Arm 2 consisted of three replicates of a test unit of 192 samples. This number of samples is driven by the automation protocol which requires at least two plates of 96 samples each

All plasma-derived samples used in the validation protocol entered the protocol workflow in the form of an amplified purified cfDNA library produced from the extracted DNA of maternal plasma.

Parental samples from two sources were used in the protocol: Maternal gDNA extracted from centrifuged maternal blood samples from which plasma has been removed; and paternal gDNA prepared from a buccal sample.

Arm 1: Sensitivity and Specificity

In the sensitivity and specificity arm, the accuracy of the test method was determined by comparing test results of the samples used in the validation protocol to their known fetal chromosome copy number.

QC failure in a plasma sample due to contamination or low NOR required that those plasma samples were rerun through the protocol. However, due to the limited volume of plasma library available for some samples, it was not be possible to rerun some samples. In those cases, samples were excluded from all Arm 1 analyses. Failed mother samples were rerun at most 2 times. Failed father samples were not rerun. Due to the high maximum capacity of the laboratory automation workflow, all plasma DNA library samples in Arm 1 will be processed in a single batch.

Arm 2: Laboratory Reproducibility

In the laboratory reproducibility arm, the reproducibility of the test method were assessed using multiple reagent lots, sets of equipment, test operators, and days. For non-critical reagents and instruments, single lots were used because they are outside the scope of this reproducibility testing. Specific reagents, instruments, operators, and execution dates were used for each run of 192 samples.

192 samples were tested for each of the three runs in the reproducibility test. Samples were isolated and extracted prior to the execution of the validation protocol. For each sample, four tubes of plasma (˜3-5 mL each) were extracted in two pairs of tubes. Each of the two extractions per sample were prepared into purified plasma DNA library, and then pooled into a single well for each case. The pooling generated approximately 70-75 μL of library material for each case. Each pooled library sample was distributed into 3 replicate sample plates (22 μL each) for use in the validation protocol.

Replicate number 1 of the 192 samples tested in Arm 2 were included in Arm 1 and underwent high depth of read reflex and rerunas necessary to generate results for Arm 1 analysis. Arm 2 replicate numbers 2 and 3 did not undergo high depth of read reflex or rerun. For all three replicates in the Arm 2, only low depth of read analysis were performed.

Only samples from replicate 1, 2, and 3 with sufficient child fraction estimate (≥6% for low risk calls and ≥10% for high risk calls) to be called at low depth of read were analyzed in the Arm 2 reproducibility experiment.

While each plasma sample was tested three times for reproducibility, the corresponding parent samples for each plasma sample trio was not amplified and sequenced as replicates. Mother samples were rerun as necessary to generate a passing QC result. Father samples were not rerun. The resultant parent sample data was used in the analysis of all three plasma sample replicates.

Data Analysis

Structural changes were made to the test method algorithms to reflect the removal of all targeted loci on chromosomes 1 and 2. The resultant SNP allele ratios from the plasma sample and parental samples were analyzed to create a maximum-likelihood estimate of the fetal chromosome copy number for each targeted chromosome. The maximum-likelihood estimate was based on two different algorithms as disclosed elsewhere herein, the het rate method and the quantitative modeling method (QMM). The het rate algorithm is based on analysis of the observed allele ratios (fraction of reference allele) at each SNP using a joint distribution model. The QMM algorithm is based on non-allelic analysis of the number of sequencing reads at each SNP in a method that produces a maximum likelihood of various pleudy hypothesis.

Data from both Arms were processed through the test method.

Analysis of Arm 1

Analysis will be performed using the father sample when available.

Samples with unrecoverable QC failures were not included in syndrome analysis nor count toward syndrome denominator for rate calculations, including no-call rate.

The criteria for aneuploidy detection were verified by observation.

The criteria for detection of at least one male and one female sample were verified by observation. The number of incorrect gender calls were computed and compared to the acceptance criteria.

Each sample was evaluated for each syndrome for a result from {high risk, low risk, risk unchanged}.

Each syndrome in the set (Trisomy 13, Trisomy 18, Trisomy 21, Monosomy X) was analyzed independently for sensitivity and specificity and the results were compared to the acceptance criteria for UR0070. Sensitivity and specificity were computed for each syndrome according to the CFE projection method described in Appendix A below, along with an approximate variance. The CFE distribution from the article Pergament et al. (2014) (Obstet Gynecol August: 124; 210-8) was used. The acceptance criteria was met if the desired sensitivity and specificity fall within the confidence bounds of the estimates from the data. The confidence bounds were defined as 3 times the square root of the estimated approximate variance.

Results with unchanged risk were not included in the sensitivity or specificity computations but were reflected in the computed no-call rate.

The requirement on no-call rate was evaluated on the subset of euploid-truth samples passing QC, rather than the complete data set. The no-call rate was computed using the CFE projection method provided in Appendix A and the commercial CFE distribution. The aneuploidy rate in commercial data is less than 2 percent and so the contribution of aneuploid samples to the commercial no-call rate is negligible.

Analysis of Arm 2

Reproducibility of clinical calls was evaluated on eligible trios. An eligible trio was one that passed QC and produced calls on more than 1 sample replicate. The acceptance criterion is that there were not more than one changed call in the set of eligible trios. This is defined as a change from high risk to low risk or the opposite. The number of changed calls were identified and compared to the acceptance criteria.

Appendix A: Computational details for sensitivity, specificity and no-call rate projected to a known CFE distribution (generated by a commercial test)

Commercial data from Panorama Version 1 (Natera, Calif.) that used control chromosomes was used to support the analysis by providing a representative commercial CFE distribution. The metrics calculated from the study data will be used to calculate projected performance metrics for the commercial product using this distribution.

Previous experiments with the test method led to the following relationship between Panorama Version 1 CFE (f1) and the test method CFE (f2) for the same blood draw. This relationship holds in the observed range of CFE, from approximately (f1=0.01) to (f1=0.35).

f₂=0.3533f₁²+0.9136f₁

The commercial child fraction estimate distribution was determined using a set of approximately 50,000 commercial test results (Panorama Version 1, Natera, Calif.), which analyzes samples using a method that utilized control chromosomes.

The equation above was used to convert the Panorama Version 1 commercial test CFE distribution into the test method CFE equivalent. This was regarded as the commercial child fraction estimate distribution going forward, such that all computations were done on the test method CFE.

The same approach can be used to generate the CFE distribution from the Pergament et al. (2014) publication.

A metric such as sensitivity can be projected to the commercial child fraction estimate distribution as follows:

Define a set of CFE intervals, i=1 to N.

Observe the population rate of each interval from the commercial data distribution, p_i.

Compute the metric of interest, xi, such as sensitivity, for the subset of the syndrome data that falls within each child fraction estimate interval.

The projected value of the metric of interest is a weighted sum across the CFE intervals.

$x = \sum_{i = 1}^{N} p_{i} x_{i}$

The variance of the projected value of the metric can be approximated by a similar method.

$VAR [x] = \sum_{i = 1}^{N} p_{i} VAR [x_{i}]$

Results and Analysis

Arm 1 Analysis: Detection, Accuracy, Failure Rate

The analysis for detection, accuracy and failure rate includes both the set of “arm 1” samples and the first replicate of the set of “arm 2” samples. Thus the starting count of eligible samples is the combined count of 587.

Samples excluded from all Arm 1 analysis due to quality control failures

As defined in the test protocol, samples failing quality control metrics were not included in detection, accuracy or failure rate performance computations of Arm 1, which wass analyzed with all samples from the Arm 1 cohort and replicate 1 of the Arm 2 cohort. Eight such cases were removed.

Those failing samples are described below in Table 2. Collectively, these samples are comprised of 5 cases of trisomy 21 and 3 euploid cases.

TABLE 2 Summary of Quality Control Failures for Arm 1 Analysis sample count failure reason affected cases 3 Contamination 339370, 339242, 339486 2 sample handling error 339415, 338867 1 unrecoverable mother sample 339617 failure 2 failed sequencing number of 339397, 339229 reads

Aneuploidy Detection and Gender Detection

The acceptance criteria were as follows:

The test was able to detect at least one sample each of Trisomy 13, 18, 21, Monosomy X, and triploidy

The test was able to detect gender for at least one male and one female sample

Not more than two incorrect gender calls will occur for eligible samples. An incorrect gender call is defined as incorrect reporting of the presence or absence of the Y chromosome.

Table 3 shows the number and type of calls for each syndrome. Note that female samples include monosomy X and a large number (248) of samples do not include gender truth.

TABLE 3 Arm 1 Analysis Results Summary Negative T13 T18 T21 MX Triploidy Total Male Female Eligible 335 15 37 179 9 4 579 170 162 Algorithm 11 1 9 13 2 0 36 7 18 Limitation (No Call) Correct Calls 324 14 28 163 7 4 540 163 144 Incorrect Calls 0 0 0 2 0 0 2 0 0 Other 0 0 0 1 0 0 1 Abnormality

All calls were correct with the exception of 2 trisomy 21 cases called false negative and 1 trisomy 21 case called “other abnormality”. The latter is discussed in more detail below.

Case 338833 was identified as having trisomy 21 through karyotype analysis of the CVS biopsy. The case was reported as “no-call due to suspected abnormality” because in addition to the trisomy 21, which was detected, there were also abnormal indications on the X chromosome. This case was not counted as a negative call in the sensitivity computation because the result was suspected abnormality including trisomy 21.

Thus, the test method was able to detect at least one sample each of Trisomy 13, 18, 21, Monosomy X, and triploidy, was able to detect gender for at least one male and one female sample, and no incorrect gender calls occurred. Therefore, the aneuploidy detection and gender detection acceptance criteria were met by the test method.

Sensitivity and Specificity

The acceptance criteria were that the sensitivity and specificity thresholds listed below fell within or below the 3-sigma bounds estimated from the data. Only samples with algorithm results were included in the analysis and the raw sensitivity and specificity values were normalized to meet the fetal fraction distribution observed in the publication by Pergament et al (2014).

T21: Sensitivity≥99.01% and specificity≥99.89%

T18: Sensitivity≥96.00% and specificity≥99.98%

T13: Sensitivity≥90.00% and specificity≥99.91%

MX: Sensitivity≥90.00% and specificity≥99.91%

Raw and fetal fraction distribution-adjusted measurements of sensitivity and specificity with confidence bounds are presented in Table 4 below.

TABLE 4 Sensitivity Trisomy Trisomy Trisomy Monosomy 13 18 21 X Correct Calls 14 28 163 7 Incorrect Calls 0 0 2 0 Observed 100% 100% 98.8% 100% Sensitivity Observed 95% 76.8-100 87.7-100 95.7-99.9 59.0-100 CI Projected 100% 100% 97.1% 100% Sensitivity Projected 3- 78.8-100 87.3-100 87.7-100 N/A Sigma Bounds

The observed sensitivity for trisomy 21 was 98.8%. After fetal fraction adjustment, the estimated sensitivity was 97.1% with a standard deviation of 3.1% and 3-sigma confidence bounds of 87.7%-100%.

The observed specificity for trisomy 13, trisomy 18, trisomy 21 and monosomy X was 100%. No adjustment to the fetal fraction distribution was applied. Monosomy X had too few calls to evaluate the fetal fraction adjustment in the confidence interval, so projected bounds were not given.

Therefore, all syndromes meet the sensitivity and specificity acceptance criteria that the required minimum values fall within or below 3 standard deviations of the observed value.

Although test sensitivity for sex chromosome trisomies such as XXX, XXY, and XYY was not specifically addressed by this study, there were no false positives for these syndromes among any of the called cases, suggesting that sex chromosome trisomy specificity was near 100%.

Algorithm Failure Rate

The acceptance criterion was that the maximum tolerable algorithm failure rate of 4.20% fell within or above the 3 sigma bounds estimated from the data. This estimate was based on projection to the currently observed commercial fetal fraction distribution and was computed from the set of negative samples with gestational age≥10 weeks.

Raw and adjusted measurements of algorithm failure rate with confidence bounds are presented in Table 5 below.

TABLE 5 Summary of Algorithm Failure Rate Analysis Count Eligible Negatives any GA 335 Eligible Negatives GA ≥ 10 Weeks 279 Called Low Risk 273 Not Called (Algorithm Limitation) 6

The observed algorithm failure rate was 2.1% before fetal fraction distribution adjustment. After adjustment the algorithm failure rate in a commercial cohort was projected to be 3.67% with a standard deviation of 2.06% and 3-sigma bounds of 0%-9.86%. Therefore, the acceptance criteria for algorithm failure rate were met.

Note that fetal fraction adjustment computations for sensitivity and algorithm failure rate followed a method based on dividing the range of fetal fractions into bins, and combining those bins according to their population in commercial data.

Arm 2: Reproducibility

Samples Excluded from all Arm 2 Analyses Due to Quality Control Failures

Samples that failed in all 3 replicates were excluded from the reproducibility analysis. Case 339617 had a failed mother gDNA sample that did not produce a passing result and, as such, was excluded leaving 89 eligible samples.

Rate of Samples Passing QC

The acceptance criterion requires that fewer than 10% of the samples in each test unit fail QC. This criterion was evaluated on each test unit independently.

Out of 89 cases, there were 87 cases where all three replicates pass QC. Two different cases (339683, 339700) each had a replicate in test unit one which failed the number of reads QC threshold. Thus the highest observed QC failure rate was two samples out of 89, or 2.2%, per test unit. This met the acceptance criterion.

Reproducibility of Clinical Results

The acceptance criterion was that of the sample triplicates that produce 2 or 3 result calls, not more than one triplicate produced an inconsistent call. At least 50 of the 192 sample triplicates had to be eligible for clinical calls in all three test units for the reproducibility analysis to be performed.

Of the 89 cases where at least 2 replicates pass QC, 22 were ineligible for review at low depth of read due to their fetal fraction. Six cases went through review and were selected (in all replicates) for resequencing at high depth of read due to suspected aneuploidy call. Two cases had all replicates identified as “uninformative DNA pattern” and were not called. Combined, 30 cases were uncallable in all three replicates, leaving 59 eligible samples that had at least two calls.

Two cases were called in one replicate and selected for resequencing at high depth of read in two replicates. Repeatability analysis was only performed for samples producing results at low depth of read and did not include analysis based on reflex to high depth of read.

Three cases each had one replicate with a QC failure or reflex request and the calls were consistent in the remaining two replicates. 54 cases had calls on all three replicates. There were no cases with inconsistent calls. Therefore, the acceptance criteria were satisfied.

Conclusions

All acceptance criteria were met by the test method. In other words, the test method was able to effectively determine the presence or absence of aneuploidy of chromosomes of interest in test samples even when scrutinized against commercial test performance.

All patents, patent applications, and published references cited herein are hereby incorporated by reference in their entirety. While the methods of the present disclosure have been described in connection with the specific embodiments thereof, it will be understood that it is capable of further modification. Furthermore, this application is intended to cover any variations, uses, or adaptations of the methods of the present disclosure, including such departures from the present disclosure as come within known or customary practice in the art to which the methods of the present disclosure pertain, and as fall within the scope of the appended claims.

Claims

1. A method for detecting a presence or absence of aneuploidy of a chromosome or chromosome segment of interest in a test sample, comprising:

obtaining genetic data for the chromosome or chromosome segment of interest from each sample in a set of samples comprising the test sample, wherein the genetic data is obtained from a parallel analysis of the samples, wherein the genetic data of the test sample are obtained by isolating a mixture of fetal cell-free genomic DNA and maternal cell-free genomic DNA from the test sample which is a blood sample of a pregnant woman, and amplifying and sequencing the mixture of fetal cell-free genomic DNA and maternal cell-free genomic DNA together;

determining whether aneuploidy is present in the test sample by a first method comprising: determining a depth of reads or a proportion of reads that map to the chromosome or chromosome segment of interest; calculating a z-score for the depth of reads or the proportion of reads that map to the chromosome or chromosome segment of interest; and determining whether the test sample is aneuploidy at the chromosome or chromosome segment of interest based on the z-score, thereby providing a first result; and

determining whether aneuploidy is present in the test sample by a second method comprising: creating a plurality of ploidy hypotheses wherein each ploidy hypothesis is associated with a specific copy number for the chromosome or chromosome segment of interest, determining a ploidy probability value for each ploidy hypothesis, wherein the ploidy probability value indicates the likelihood that the test sample has the specific copy number for the chromosome or chromosome segment of interest that is associated with the ploidy hypothesis, and determining which ploidy hypothesis is most likely to be correct by selecting the ploidy hypothesis with the maximum likelihood, thereby providing a second result,

wherein aneuploidy is detected by considering the first result and the second result.

2. The method according to claim 1, wherein the genetic data comprises quantitative allelic data from a plurality of polymorphic loci in the set of loci, wherein each of the ploidy hypotheses specifies an expected distribution of quantitative allelic data at the plurality of polymorphic loci, and wherein the ploidy probability values are determined by calculating, for each of the ploidy hypotheses, the fit between the expected genetic data and the obtained genetic data.

3. The method according to claim 1, wherein the genetic data comprises quantitative non-allelic data from a plurality of polymorphic loci in the set of loci, and wherein each of the ploidy hypotheses specifies an expected mean value of quantitative non-allelic data at the plurality of polymorphic loci, and wherein the ploidy probability values are determined by calculating, for each of the ploidy hypotheses, the fit between the expected genetic data and the obtained genetic data.

4. The method according to claim 1, wherein the first result is determined by calculating a likelihood based on the z-score.

5. The method according to claim 4, wherein aneuploidy is detected by combining the aneuploidy likelihoods from the first method and these second method using the following formula:

Combined likelihood=R1R2/[R1R2+(1−R1)(1−R2)].

6. The method according to claim 1, wherein the first result is determined by determining whether the z-score for the test sample is above a threshold value.

7. The method according to claim 1, wherein the second method comprises a quantitative allelic method.

8. The method according to claim 7, wherein the quantitative allelic method is het rate method.

9. The method according to claim 8, wherein the het rate method is based on analysis of observed allele ratios at each SNP using a joint distribution model.

10. The method according to claim 1, wherein the second method comprises a quantitative non-allelic method.

11. The method according to claim 10, wherein the quantitative non-allelic method is QMM method.

12. The method according to claim 11, wherein the QMM method is based on analysis of the number of sequencing reads at each SNP.

13. The method according to claim 1, wherein the second method comprises both a quantitative allelic method and a quantitative non-allelic method.

14. A method for determining a presence or absence of a fetal aneuploidy in a fetus for each of a plurality of maternal blood samples obtained from a plurality of different pregnant women, said maternal blood samples comprising fetal and maternal cell-free genomic DNA, said method comprising:

determining a number of enumerated sequence reads corresponding to a chromosome or chromosome segment of interest for each of the plurality of samples;

determining a reference value of enumerated sequence reads from a diploid subset of between 1 and 50 samples of the plurality of samples or between 1-50% of samples of the plurality of samples having a number of enumerated sequence reads closest to the median number of enumerated sequence reads for the chromosome or chromosome segment of interest for the plurality of maternal blood samples, without using determining sequencing reads for a separate reference chromosome; and

comparing the enumerated sequence reads from each of the other samples of the plurality of samples that are not diploid samples, to the reference value, thereby determining the presence or absence of a fetal aneuploidy in the chromosome or chromosome segment of interest.

15. The method according to claim 14, further comprising before the determining the number of enumerated sequence reads:

obtaining a fetal and maternal cell-free genomic DNA sample from each of the plurality of maternal blood samples;

generating a library derived from each fetal and maternal cell-free genomic DNA sample,

performing massively parallel sequencing of polynucleotide sequences of the library from the chromosome or chromosome segment of interest; and

enumerating sequence reads corresponding to fetal and maternal polynucleotide sequences selected from the chromosome or chromosome segment of interest.

16. The method according to claim 14, wherein the reference value of enumerated sequence reads is determined from a diploid subset of between 10 and 40 samples closest to the median.

17. The method according to claim 14, wherein the reference value of enumerated sequence reads is determined from a diploid subset of between 15 and 40 samples closest to the median.

18. The method according to claim 14, wherein each library of enriched and indexed fetal and maternal polynucleotide sequences includes an indexing nucleotide sequence which identifies a maternal blood sample of the plurality of maternal blood samples and pooling the libraries generated to produce a pool of enriched and indexed fetal and maternal non-random polynucleotide sequences.

19. The method according to claim 14, wherein said plurality of polynucleotide sequences comprises at least 100 different non-random polynucleotide sequences, wherein each of said plurality of non-random polynucleotide sequences is from 10 to 1000 nucleotide bases in length.

20. The method according to claim 14, wherein the method further comprises selectively enriching a plurality of non-random polynucleotide sequences of each fetal and maternal cell-free genomic DNA samples.