METHODS AND COMPOSITIONS FOR PREDICTING AND/OR MONITORING CARDIOVASCULAR DISEASE AND INTERVENTIONS THEREFOR
This document describes methods and compositions for predicting cardiovascular disease (CVD). Specifically, this document describes methods and compositions for determining the methylation status of at least one CpG locus and the sequence of at least one single nucleotide polymorphism (SNP) that are predictive for the detection of CVD or for estimation of survival from CVD.
This disclosure generally relates to methods and compositions related to predicting cardiovascular disease (CVD) in an individual.
BACKGROUNDCardiovascular disease (CVD), and particularly coronary heart disease (CHD), is the most common type of heart disease and was responsible for over 360,000 deaths in the United States in 2017. In order to decrease this toll, a number of risk estimators and detection methods have been developed to better identify those with or at risk for CVD, including CHD. Beginning with the Framingham Risk Score (FRS) and more recently, the ASCVD Pooled Cohort Equation (PCE), these tools capture variance in key physiological parameters, such as serum lipid levels, known to be associated with risk for CVD, including CHD. Similarly, for detection, methods such as stress echocardiogram and Coronary Computed Tomography Angiography (CCTA) are used.
Despite the magnitude of these efforts, current risk estimators and detection tests often lack in sensitivity and specificity, and often are not as accessible due to cost and the need to schedule an in-person clinical visit that may take weeks. Furthermore, some of the current risk estimators and detection methods such as catheterization may have severe side effects such as stroke and heart attack. As a result, there is a need for alternative stratification, detection and management approaches for CVD that have minimal risks, are scalable, and provide actionable insights.
SUMMARYMethods and compositions for predicting the presence and/or severity (e.g., level of obstruction) of cardiovascular disease (CVD) are provided, and methods and compositions for managing, monitoring, and/or treating CVD are provided. For example, methods and compositions for predicting coronary heart disease (CHD) are described herein. The general principals apply to windows of incidence (e.g., one-month, six-month, two-year, or ten-year) as well as the incidence, prevalence, or severity of other types of CVD including, without limitation, CHD, stroke, arrhythmia, cardiac arrest, and congestive heart failure. The same general principals also apply to survival of CVD or CVD events as well as to the management of CVD or CVD events, including but not limited to identifying, customizing, and optimizing lifestyle (e.g., exercise, diet) and/or therapeutic (e.g., the particular drug or combination thereof) and/or medical intervention(s) (e.g., stent placement, angioplasty). The same general principals also apply to the monitoring of CVD or CVD events, the severity of CVD or CVD events, and/or the response to lifestyle, therapeutic and/or medical intervention(s). Specifically, methods and compositions that include determining the methylation status of at least one CpG locus and/or at least one single nucleotide polymorphism (SNP) are described.
In one aspect, kits for determining methylation status of at least one CpG dinucleotide and/or a genotype of at least one single-nucleotide polymorphism (SNP) are provided. Such kits typically include at least one first nucleic acid primer at least 8 nucleotides in length that is complementary to a bisulfite-converted nucleic acid sequence comprising a first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or at a second CpG dinucleotide in linkage disequilibrium with the first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, wherein the linkage disequilibrium has a value of R>0.3, wherein the at least one first nucleic acid primer detects a methylated or unmethylated CpG dinucleotide, and/or at least one second nucleic acid primer at least 8 nucleotides in length that is complementary to a DNA sequence of a first SNP selected from the group consisting of rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or a second SNP in linkage disequilibrium with the first SNP selected from the group consisting of rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433, wherein the linkage disequilibrium has a value of R>0.3.
In some embodiments, the at least one first nucleic acid primer detects the unmethylated CpG dinucleotide. In some embodiments, the at least one first nucleic acid primer detects the methylated CpG dinucleotide.
In some embodiments, the kits described herein further including at least a third nucleic acid primer at least 8 nucleotides in length that is complementary to a nucleic acid sequence upstream of the CpG dinucleotide. In some embodiments, the kits further include at least a third nucleic acid primer at least 8 nucleotides in length that is complementary to a nucleic acid sequence downstream of the CpG dinucleotide.
In some embodiments, the at least one first nucleic acid primer comprises one or more nucleotide analogs. In some embodiments, the at least one first nucleic acid primer comprises one or more synthetic or non-natural nucleotides.
In some embodiments, the kits described herein further include a solid substrate to which the at least one first nucleic acid primer is bound. In some embodiments, the substrate is a polymer, glass, semiconductor, paper, metal, gel or hydrogel. In some embodiments, the solid substrate is a microarray or microfluidics card.
In some embodiments, the kits described herein further include a detectable label.
In another aspect, methods of determining the presence of biomarkers associated with predicting, treating, managing and/or monitoring CVD in a biological sample from a patient is provided. Such methods typically include (a) providing a first portion of the biological sample and a second portion of the biological sample, wherein the nucleic acid from at least the first portion is bisulfite converted; (b) contacting the first portion of the biological sample with a first oligonucleotide primer at least 8 nucleotides in length that is complementary to a sequence that comprises a first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, or a second CpG dinucleotide in linkage disequilibrium with the first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, wherein the linkage disequilibrium has a value of R>0.3, wherein the first nucleic acid primer detects a methylated or unmethylated CpG dinucleotide; and (c) contacting the second portion of the biological sample with a nucleic acid primer at least 8 nucleotides in length that is complementary to a DNA sequence or a bisulfite-converted DNA sequence of a first SNP selected from the group consisting of rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or a second SNP in linkage disequilibrium with the first SNP selected from the group consisting of rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433, wherein the linkage disequilibrium has a value of R>0.3. Generally, the percentage of methylation of the CpG dinucleotide at the GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, and the identity of the nucleotide at the first SNP selected from the group consisting of rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or the second SNP in linkage disequilibrium with the first SNP are biomarkers associated with detecting CVD or estimating survival from CVD.
In some embodiments, the biological sample is blood or saliva.
In some embodiments, the at least one first nucleic acid primer detects the unmethylated CpG dinucleotide. In some embodiments, the at least one first nucleic acid primer detects the methylated CpG dinucleotide.
In some embodiments, the at least one first nucleic acid primer comprises one or more nucleotide analogs. In some embodiments, the at least one first nucleic acid primer comprises one or more synthetic or non-natural nucleotides.
In some embodiments, the window of incidence for detection, severity, managing and/or monitoring is three years, five years, or ten years.
In still a further aspect, methods of determining the presence of biomarkers in a biological sample from a subject, wherein the biomarkers are associated with detecting CVD, determining severity of CVD, estimating survival from CVD, identifying, customizing, and/or optimizing intervention(s) for CVD, managing CVD and/or monitoring CVD. Such methods typically include (a) isolating nucleic acid sample from the patient sample, (b) performing a genotyping assay on a first portion of the nucleic acid sample to detect the presence of at least one SNP, wherein the at least one SNP is a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C and/or is a second SNP in linkage disequilibrium (R>0.3) with a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C to obtain genotype data; and/or (c) bisulfite converting the nucleic acid in a second portion of the nucleic acid and performing methylation assessment on the second portion of the nucleic acid sample to detect methylation status of at least one CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and/or a CpG site collinear (R>0.3) with a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A to obtain methylation data; and (d) entering the genotype data from step (b) and/or methylation data from step (c) into an algorithm that accounts for at least one SNP main effect and/or at least one CpG main effect and/or at least one interaction effect, wherein the algorithm is a machine learning algorithm capable of accounting for linear and non-linear effects.
In some embodiments, the at least one interaction effect is selected from the group consisting of a gene-environment interaction (SNP×CpG) effect, a gene-gene interaction (SNP×SNP) effect, and an environment-environment interaction (CpG×CpG) effect. In some embodiments, the at least one interaction effect is a gene-environment interaction effect (SNP×CpG) between a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A or a CpG site that is collinear (R>0.3) with a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and a SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C or a SNP within moderate linkage disequilibrium (R>0.3) from a SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C. In some embodiments, the at least one interaction effect is an environment-environment interaction effect (CpG×CpG) between at least two CpG sites selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A.
In some embodiments, one or both of the at least two CpG sites are collinear (R>0.3) with one or both of the at least two CpG sites selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A. In some embodiments, the at least one interaction effect is a gene-gene interaction effect (SNP×SNP) between at least two SNPs selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C. In some embodiments, one or both of the at least two SNPs are collinear (R>0.3) with one or both of the at least two SNPs selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C.
In some embodiments, the biological sample is a saliva sample.
In another aspect, systems for determining methylation status of at least one CpG dinucleotide and a genotype of at least one single-nucleotide polymorphism (SNP) are provided. Such systems typically include: a nucleic acid isolation module configured to isolate a nucleic acid sample from a subject sample; a genotyping assay module configured to perform a genotyping assay on a first portion of the nucleic acid sample to detect the presence of at least one SNP, wherein the at least one SNP is a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C and/or is a second SNP in linkage disequilibrium (R>0.3) with a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C to obtain genotype data; a methylation assay module configured to bisulfite convert the nucleic acid in a second portion of the nucleic acid and perform a methylation assessment on a second portion of the nucleic acid sample to detect methylation status of at least one CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and/or a CpG site collinear (R>0.3) with a CpG selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A to obtain methylation data; and an identification system configured to account for at least one SNP main effect and/or at least one CpG main effect and/or at least one interaction effect based on the genotype data and/or methylation data.
In some embodiments, such systems further include an output module configured to provide an output based on an identification by the identification system, wherein the identification accounts for at least one SNP main effect and/or at least one CpG main effect and/or at least one interaction effect based on the genotype data and/or methylation data.
In some embodiments, the algorithm is a machine learning algorithm capable of accounting for linear and/or non-linear effects.
In some embodiments, dimensionality reductions (e.g., principal component analysis, partial least squares regression, etc.) can be used.
In yet another aspect, non-transitory computer-readable media storing instructions executable by a processing device to perform operations are provided. Such operations typically include accounting for at least one SNP main effect and/or at least one CpG main effect and/or at least one interaction effect based on genotype data and/or methylation data, wherein: (i) the genotype data is based on a genotyping assay on a first portion of a nucleic acid sample isolated from a subject sample to detect the presence of at least one SNP, wherein the at least one SNP is a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C and/or is a second SNP in linkage disequilibrium (R>0.3) with a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C to obtain the genotype data; and (ii) the methylation data is based on a methylation assay on a bisulfite converted nucleic acid in a second portion of the nucleic acid sample to detect methylation status of at least one CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and/or a CpG site collinear (R>0.3) with a CpG selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A to obtain methylation data.
In some embodiments, the operations further include providing an output based on the accounting. Representative outputs, without limitation, include one or more of storing a report based on the accounting to another non-transitory computer-readable medium, modifying a display based on the accounting, triggering an audible alert based on the accounting, triggering a haptic or vibratory alert based on the accounting, triggering the printing of a report based on the accounting, or triggering the delivery of a therapeutic based on the accounting.
In one aspect, kits for determining methylation status of at least one CpG dinucleotide are provided. Such kits typically include at least one first nucleic acid primer at least 8 nucleotides in length that is complementary to a bisulfite-converted nucleic acid sequence comprising a first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or at a second CpG dinucleotide in linkage disequilibrium with the first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, wherein the linkage disequilibrium has a value of R>0.3, wherein the at least one first nucleic acid primer detects a methylated or unmethylated CpG dinucleotide.
In another aspect, methods of determining the presence of biomarkers in a biological sample from a subject, wherein the biomarkers are associated with detecting CVD, determining severity of CVD, estimating survival from CVD, identifying, customizing, and/or optimizing intervention(s) for CVD, managing CVD and/or monitoring CVD are provided. Such methods typically include (a) providing a biological sample from the subject at risk for or having CVD or CVD events, wherein nucleic acids from at least a portion of the biological sample are bisulfite converted; and (b) contacting the bisulfite converted nucleic acids with a first oligonucleotide primer at least 8 nucleotides in length that is complementary to a sequence that comprises a first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, or a second CpG dinucleotide in linkage disequilibrium with the first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, wherein the linkage disequilibrium has a value of R>0.3, wherein the first nucleic acid primer detects a methylated or unmethylated CpG dinucleotide, wherein the percentage of methylation of the CpG dinucleotide at the GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 is associated with estimating survival of the subject.
In another aspect, methods of determining the presence of biomarkers in a biological sample from a subject, wherein the biomarkers are associated with detecting CVD, determining severity of CVD, estimating survival from CVD, identifying, customizing, and/or optimizing intervention(s) for CVD, managing CVD and/or monitoring CVD are provided. Such methods typically include (a) isolating nucleic acid sample from the subject sample; (b) bisulfite converting at least a portion of the nucleic acid and performing methylation assessment on the bisulfite converted nucleic acid to determine the methylation status of at least one CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and/or a CpG site collinear (R>0.3) with a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A to obtain methylation data; and (c) entering the methylation data from step (b) into an algorithm that accounts for at least one CpG main effect, wherein the algorithm is a machine learning algorithm capable of accounting for linear and non-linear effects.
In still another aspect, systems for determining methylation status of at least one CpG dinucleotide are provided. Such systems typically include a nucleic acid isolation module configured to isolate a nucleic acid sample from a subject sample; a methylation assay module configured to bisulfite convert the nucleic acid in at least a portion of the nucleic acid and perform a methylation assessment on the bisulfite converted nucleic acid to determine methylation status of at least one CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and/or a CpG site collinear (R>0.3) with a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A to obtain methylation data; and an identification system configured to account for at least one CpG main effect based on the methylation data.
In yet another aspect, a non-transitory computer-readable medium storing instructions executable by a processing device to perform operations is provided. Such a computer-readable medium typically includes accounting for at least one CpG main effect based on methylation data, wherein: the methylation data is based on a methylation assay on a bisulfite converted nucleic acid in at least a portion of a nucleic acid sample to detect methylation status of at least one CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and/or a CpG site collinear (R>0.3) with a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A to obtain methylation data.
The integrated genetic-epigenetic model described herein provides several advantages and benefits. For example,
-
- Earlier Detection: PrecisionCHD can detect molecular changes that may or may not precede the development of clinical symptoms or disease. This means patients can be identified with coronary heart disease before or after they develop symptoms, allowing for earlier interventions and better outcomes.
- Actionable Clinical Intelligence™: The test may or may not be coupled to a provider-facing Actionable Clinical Intelligence platform (see, for example, U.S. Application No. 63/488,463, incorporated herein by reference) that maps each patient's molecular markers to key drivers of coronary heart disease, allowing for tailored recommendations for lifestyle modifications, medical interventions, estimating effectiveness of lifestyle modifications or medical interventions, monitoring and secondary testing.
- Personalized Intervention Selection: The test can be used to select one or more interventions such as lifestyle modification, therapeutic interventions and medical interventions for each patient or a group of patients at one or more time points.
- Personalized Intervention Optimization: The test can be used to optimize one or more interventions such as lifestyle modification, therapeutic interventions and medical interventions for each patient or a group of patients at one or more time points.
- Personalized Intervention Evaluation: The test can be used to assess the effectiveness of interventions such as lifestyle modification, therapeutic interventions, and medical interventions for each patient or a group of patients by continually monitoring CVD, CVD events, or severity of CVD.
- More Comprehensive Assessment: The test evaluates and integrates robust genetic and epigenetic biomarkers simultaneously, providing a more comprehensive assessment of CVD status.
- Discover New Pathways: The approach can be used to discover new, previously unknown biological pathways for risk assessment, detection, intervention (e.g., lifestyle, therapeutic, medical), management and monitoring of CVD. The pathway(s) and biomarker(s) also can be used for the discovery, development and validation of novel biopharmaceuticals for the assessment and management of cardiovascular disease. The approach described herein can be used to identify new biomarker(s) (e.g., methylation, SNP, protein, etc.) for new drug development or the ability for targeted treatment such as gene editing. The approach described herein also can be used to discover biomarker(s) to select the most effective drug for a particular individual (e.g., statin vs. beta blocker), the most effective drug type (e.g., hydrophilic statin vs. lipophilic statin), changes in lifestyle, medical interventions, or combinations thereof. In addition, the approach described herein can be used to optimize the use of a therapeutic (e.g., dosing, regimen, drug combinations) or to identify lifestyle changes that would have the most effect.
- Non-Invasive: The test only requires a simple biomaterial collection (e.g., blood or saliva sample), making it a non-invasive and convenient alternative to more invasive diagnostic tests such as angiograms.
- Accessible: The test can be administered remotely via a lancet-based sample collection kit that can be sent to the patient's home upon test order, thereby increasing and democratizing access to CVD diagnostic tests. Or biomaterial can be collected in-provider settings via a vacutainer-based sample collection. The biomaterial in the form of a saliva sample can be collected remotely or in-provider settings.
- Cost-effective and Timely: PrecisionCHD provides clinicians with a timely and cost-effective coronary heart disease test. PrecisionCHD is a fraction of the cost of other heart disease tests.
- Survival Estimates: PrecisionCHD or PrecisionCHD-Epi can provide survival estimates for those individuals that have already been determined to have CVD or are considered at risk of developing CVD.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the methods and compositions of matter belong. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the methods and compositions of matter, suitable methods and materials are described below. In addition, the materials, methods, and examples are illustrative only and not intended to be limited to predicting incident CHD. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety.
Recent prediction strategies have taken advantage of the rapid advancements in assessing genome-wide genetic or transcriptional variation. Though each of these approaches have had some success, their clinical impact has been limited. In particular, those relying only on genetic information have a clear ceiling in predictive capacity, are potentially sensitive to ethnic stratification, and, because genotype is static, cannot be used to monitor changes in disease status.
Recent advances in genome-wide epigenetic profiling techniques have raised the possibility that DNA methylation assessments of peripheral blood DNA may serve as a mechanism for more accurate prediction of cardiovascular disease or mortality. Prediction models that only account for epigenetic signatures, however, fail to account for confounding genetic variation, which affects the vast majority of the environmentally responsive methylome. This may result in models that lack robustness with respect to generalizability, especially in different ethnic groups.
As a result, we have developed a highly sensitive, clinically implementable integrated genetic-epigenetic tool capable of identifying those at risk of or with cardiovascular disease (e.g., having a heart attack or sudden cardiac death). As shown herein, the methylation status of one or more particular CpG dinucleotides in combination with the genotype at one or more particular loci (e.g., CH3×SNP) can be used to predict cardiovascular disease (CVD) including coronary heart disease (CHD). We also have developed a highly sensitive tool capable of estimating the survivability of those who are at risk of developing CVD, those who are at risk of having a CVD event, or those who have already been identified as having CVD.
As described herein, biomarkers can be used in the diagnosis and prognosis of cardiovascular diseases and events. The terms “marker” and “biomarker” can be used interchangeably. As used herein, a biomarker generally refers to a measurable or detectable biological moiety (e.g., the presence or amount of a protein, a genetic (e.g., polymorphism), epigenetic (e.g., methylation), and/or histological component). As described in more detail below, the biomarkers used herein typically are associated with cardiovascular disease.
As used herein, “patient,” “subject” and “individual” may be used interchangeably.
DNA MethylationDNA does not exist as naked molecules in the cell. For example, DNA is associated with proteins called histones to form a complex substance known as chromatin. Chemical modifications of the DNA or the histones alter the structure of the chromatin without changing the nucleotide sequence of the DNA. Such modifications are described as “epigenetic” modifications of the DNA. Changes to the structure of the chromatin can have a profound influence on gene expression. If the chromatin is condensed, factors involved in gene expression may not have access to the DNA, and the genes will be switched off. Conversely, if the chromatin is “open,” the genes can be switched on. Some important forms of epigenetic modification are DNA methylation and histone deacetylation.
DNA methylation is a chemical modification of the DNA molecule itself and is carried out by an enzyme called DNA methyltransferase. Methylation can directly switch off gene expression by preventing transcription factors binding to promoters. A more general effect is the attraction of methyl-binding domain (MBD) proteins. These are associated with further enzymes called histone deacetylases (HDACs), which function to chemically modify histones and change chromatin structure. Chromatin-containing acetylated histones are open and accessible to transcription factors, and the genes are potentially active. Histone deacetylation causes the condensation of chromatin, making it inaccessible to transcription factors and causing the silencing of genes.
CpG islands are short stretches of DNA in which the frequency of the CpG sequence is higher than other regions. The “p” in the term CpG indicates that cysteine (“C”) and guanine (“G”) are connected by a phosphodiester bond. CpG islands are often located around promoters of housekeeping genes and many regulated genes. At these locations, the CG sequence in active genes are oftentimes not methylated. By contrast, the CG sequences in inactive genes are usually methylated to suppress their expression.
As used herein, the term “methylation status” means the determination whether a certain target DNA, such as a CpG dinucleotide, is methylated or is unmethylated. As used herein, the term “CpG dinucleotide repeat motif” means a series of two or more CpG dinucleotides positioned in a DNA sequence.
About 56% of human genes and 47% of mouse genes are associated with CpG islands. Often, CpG islands overlap the promoter and extend about 1000 base pairs downstream into the transcription unit. Identification of potential CpG islands during sequence analysis helps to define the extreme 5′ ends of genes, something that is notoriously difficult with cDNA-based approaches. The methylation of a CpG island can be determined by a skilled artisan using any method suitable to determine such methylation. For example, the skilled artisan can use a bisulfite reaction-based method for determining such methylation.
The present disclosure provides methods to determine the nucleic acid methylation of one or more loci in a subject in order to identify subjects having CVD.
Linkage refers to the phenomenon that DNA sequences which are close together in the genome have a tendency to be inherited together. Two sequences may be linked because of some selective advantage of co-inheritance. More typically, however, two sequences are co-inherited because of the relative infrequency with which meiotic recombination events occur within the region between the two sequences. The co-inherited sequences are said to be in “linkage disequilibrium” with one another because, in a given population, they tend to either both occur together or else not occur at all in any particular member of the population. Indeed, where multiple sequences in a given chromosomal region are found to be in linkage disequilibrium with one another, they define a quasi-stable “haplotype.” In contrast, recombination events occurring between two loci cause them to become separated onto distinct homologous chromosomes. If meiotic recombination between two physically linked sequences occurs frequently enough, the two sequences will appear to segregate independently and are said to be in linkage equilibrium.
It would be understood that linkage disequilibrium can be quantitated (using, for example, the Pearson correlation (R) or co-inheritance of alleles (D′)). For example, a low level of linkage can be reflected in a correlation (e.g., R value) of about 0.1 or less, a moderate level of linkage is reflected in a R value of about 0.3, while a high level of linkage is reflected in a R value of 0.5 or greater. It also would be understood that, when referring to methylation (i.e., CpG sites), collinearity (with an R value) is used as a determination of the linear strength of the association between two CpGs (e.g., a low level of collinearity can be reflected by an R value of about 0.1 or less; a moderate level of collinearity can be reflected by an R value of about 0.3; and a high level of collinearity can be reflected by an R value of about 0.5 or greater).
In particular, in certain embodiments of the disclosure, the methods may be practiced as follows. A sample, such as a blood sample, is taken from a subject. In certain embodiments, a single cell type, e.g., lymphocytes, basophils, or monocytes isolated from the blood, may be isolated for further testing. The DNA is harvested from the sample and examined to determine the methylation of one or more loci. For example, the DNA of interest can be treated with bisulfite to deaminate unmethylated cytosine residues to uracil. Since uracil base pairs with adenosine, thymidines are incorporated into subsequent DNA strands in the place of unmethylated cytosine residues during subsequence PCR amplifications. Next, the target sequence is amplified by PCR, and probed with a loci-specific probe. Depending on the particular sequence of the probe used, only the methylated or unmethylated DNA will bind to the probe.
Methods of determining the subject nucleic acid profile are well known to a skilled artisan and include any of the well-known detection methods. Various PCR methods are described, for example, in PCR Primer: A Laboratory Manual, Dieffenbach 7 Dveksler, Eds., Cold Spring Harbor Laboratory Press, 1995. Other methods include, but are not limited to, nucleic acid quantification, restriction enzyme digestion, DNA sequencing, hybridization technologies, such as Southern Blotting, amplification methods such as Ligase Chain Reaction (LCR), Nucleic Acid Sequence Based Amplification (NASBA), Self-sustained Sequence Replication (SSR or 3SR), Strand Displacement Amplification (SDA), and Transcription Mediated Amplification (TMA), Quantitative PCR (qPCR), digital PCR (dPCR) (e.g., digital droplet PCR (ddPCR)) or other DNA analyses, as well as RT-PCR, in vitro translation, Northern blotting, and other RNA analyses. In another embodiment, hybridization on a microarray is used.
Single Nucleotide Polymorphism (SNP)Traditional methods for the screening of heritable diseases have depended on either the identification of abnormal gene products (e.g., sickle cell anemia) or an abnormal phenotype (e.g., mental retardation). With the development of simple and inexpensive genetic screening methodology, it is now possible to identify polymorphisms that indicate a propensity to develop disease, even when the disease is of polygenic origin.
Single nucleotide polymorphism (SNP) genotyping measures genetic variations of SNPs between members of a species. A SNP is a single base pair change at a specific locus, usually consisting of two alleles (where the rare allele frequency is >1%). SNPs are very common. Because SNPs are conserved during evolution, they have been proposed as markers for use in quantitative trait loci (QTL) analysis and in association studies in place of microsatellites. Many different SNP genotyping methods are known, including hybridization-based methods (such as Dynamic allele-specific hybridization, molecular beacons, and SNP microarrays) enzyme-based methods (including restriction fragment length polymorphism, PCR-based methods, flap endonuclease, primer extension, 5′-nuclease, and oligonucleotide ligation assay), other post-amplification methods based on physical properties of DNA (such as single strand conformation polymorphism, temperature gradient gel electrophoresis, denaturing high performance liquid chromatography, high-resolution melting of the entire amplicon, use of DNA mismatch-binding proteins, SNPlex and surveyor nuclease assay), and sequencing (such as “next generation” sequencing). See, e.g., U.S. Pat. No. 7,972,779.
A plurality of alleles at a locus can arise from one or more polymorphisms in a region of a gene that encodes a polypeptide or in a regulatory control sequence that affects expression of the polypeptide, such as a promoter or polyadenylation sequence. Alternatively, alleles can arise from one or more polymorphisms at a locus distal to a gene that encodes a polypeptide or in a regulatory control sequence. A polymorphism can affect a polypeptide at a transcriptional or a translational level (e.g., a polypeptide's transcription rate, translation rate, degradation rate, and/or activity). Allelic differences can be characterized in a sample from a single subject or from a plurality of subjects using methods that are known to a skilled artisan. Such methods can include, but are not limited to, measuring the potential for a polynucleotide sequence to be expressed and/or measuring an amount of an encoded polypeptide. Methods are available that can detect proteins or nucleic acids directly or indirectly, and assay methods are specifically contemplated to include screening for the presence of particular sequences or structures of nucleic acids or polypeptides using, e.g., any of various known microarray technologies.
It will be fully appreciated by the skilled artisan that the allele need not have previously been shown to have had any link or association with the disorder phenotype. Instead, an allele and a pathogenic environmental risk factor can interact to predict a predisposition to a disorder phenotype even when neither the allele nor the risk factor bears any direct relation to the disorder phenotype.
Genetic screening (also called genotyping or molecular screening) can be broadly defined as testing to determine if a subject has mutations (or alleles or polymorphisms) that either cause a disease state or are “linked” to the mutation causing a disease state. Linkage refers to the phenomenon that DNA sequences which are close together in the genome have a tendency to be inherited together. Two sequences may be linked because of some selective advantage of co-inheritance. More typically, however, two polymorphic sequences are co-inherited because of the relative infrequency with which meiotic recombination events occur within the region between the two polymorphisms. The co-inherited polymorphic alleles are said to be in “linkage disequilibrium” with one another because, in a given population, they tend to either both occur together or else not occur at all in any particular member of the population. Indeed, where multiple polymorphisms in a given chromosomal region are found to be in linkage disequilibrium with one another, they define a quasi-stable genetic “haplotype.” In contrast, recombination events occurring between two polymorphic loci cause them to become separated onto distinct homologous chromosomes. If meiotic recombination between two physically linked polymorphisms occurs frequently enough, the two polymorphisms will appear to segregate independently and are said to be in linkage equilibrium.
It would be understood that linkage disequilibrium can be quantitated (using, for example, the Pearson correlation (R) or co-inheritance of alleles (D′)). For example, a low level of linkage can be reflected in a correlation (e.g., R value) of about 0.1 or less, a moderate level of linkage is reflected in a R value of about 0.3, while a high level of linkage is reflected in a R value of 0.5 or greater.
While the frequency of meiotic recombination between two markers is generally proportional to the physical distance between them on the chromosome, the occurrence of “hot spots” as well as regions of repressed chromosomal recombination can result in discrepancies between the physical and recombinatorial distance between two markers. Thus, in certain chromosomal regions, multiple polymorphic loci spanning a broad chromosomal domain may be in linkage disequilibrium with one another, and thereby define a broad-spanning genetic haplotype. Furthermore, where a disease-causing mutation is found within or in linkage with this haplotype, one or more polymorphic alleles of the haplotype can be used as a diagnostic or prognostic indicator of the likelihood of developing the disease. This association between otherwise benign polymorphisms and a disease-causing polymorphism occurs if the disease mutation arose in the recent past, so that sufficient time has not elapsed for equilibrium to be achieved through recombination events. Therefore, identification of a haplotype that spans or is linked to a disease-causing mutational change serves as a predictive measure of an individual's likelihood of having inherited that disease-causing mutation. Such prognostic or diagnostic procedures can be utilized without necessitating the identification and isolation of the actual disease-causing lesion. This is significant because the precise determination of the molecular defect involved in a disease process can be difficult and laborious, especially in the case of multifactorial diseases.
The statistical correlation between a disorder and a polymorphism does not necessarily indicate that the polymorphism directly causes the disorder. Rather the correlated polymorphism may be a benign allelic variant which is linked to (i.e., in linkage disequilibrium with) a disorder-causing mutation that has occurred in the recent evolutionary past, so that sufficient time has not elapsed for equilibrium to be achieved through recombination events in the intervening chromosomal segment. Thus, for the purposes of diagnostic and prognostic assays for a particular disease, detection of a polymorphic allele associated with that disease can be utilized without consideration of whether the polymorphism is directly involved in the etiology of the disease. Furthermore, where a given benign polymorphic locus is in linkage disequilibrium with an apparent disease-causing polymorphic locus, still other polymorphic loci which are in linkage disequilibrium with the benign polymorphic locus are also likely to be in linkage disequilibrium with the disease-causing polymorphic locus. Thus, these other polymorphic loci will also be prognostic or diagnostic of the likelihood of having inherited the disease-causing polymorphic locus. A broad-spanning haplotype (describing the typical pattern of co-inheritance of alleles of a set of linked polymorphic markers) can be targeted for diagnostic purposes once an association has been drawn between a particular disease or condition and a corresponding haplotype. Thus, the determination of an individual's likelihood for developing a particular disease of condition can be made by characterizing one or more disease-associated polymorphic alleles (or even one or more disease-associated haplotypes) without necessarily determining or characterizing the causative genetic variation.
Many methods are available for detecting specific alleles at polymorphic loci. Certain methods for detecting a specific polymorphic allele will depend, in part, upon the molecular nature of the polymorphism. For example, the various allelic forms of the polymorphic locus may differ by a single base-pair of the DNA. Such single nucleotide polymorphisms (or SNPs) are major contributors to genetic variation, comprising some 80% of all known polymorphisms, and their density in the genome is estimated to be on average 1 per 1,000 base pairs. SNPs are most frequently bi-allelic, or occurring in only two different forms (although up to four different forms of an SNP, corresponding to the four different nucleotide bases occurring in DNA, are theoretically possible). Nevertheless, SNPs are mutationally more stable than other polymorphisms, making them suitable for association studies in which linkage disequilibrium between markers and an unknown variant is used to map disease-causing mutations. In addition, because SNPs typically have only two alleles, they can be genotyped by a simple plus/minus assay rather than a length measurement, making them more amenable to automation.
In one embodiment, allelic profiling can be accomplished using a nucleic acid microarray. The genetic testing field is rapidly evolving and, as such, the skilled artisan will appreciate that a wide range of profiling tests exist, and will be developed, to determine the allelic profile of individuals in accord with the disclosure.
Nucleic Acids and PolypeptidesAs described herein, the methods provided in this disclosure rely upon features contained within the nucleic acid of an individual, subject or patient. The term “nucleic acid” refers to deoxyribonucleotides or ribonucleotides and polymers thereof in either single- or double-stranded form, made of monomers (nucleotides) containing a sugar, phosphate and a base that is either a purine or pyrimidine. Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) and complementary sequences, as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues. The terms “nucleic acid,” “nucleic acid molecule,” or “polynucleotide” are used interchangeably and may also be used interchangeably with gene, cDNA, DNA and/or RNA encoded by a gene.
The term “nucleotide sequence” refers to a polymer of DNA or RNA which can be single-stranded or double-stranded, optionally containing synthetic, non-natural or altered nucleotide bases capable of incorporation into DNA or RNA polymers. A DNA molecule or polynucleotide is a polymer of deoxyribonucleotides (A, G, C, and T), and an RNA molecule or polynucleotide is a polymer of ribonucleotides (A, G, C and U).
A “gene,” for the purposes of the present disclosure, includes a DNA region encoding a gene product, as well as all DNA regions, which regulate the production of the gene product, whether or not such regulatory sequences are adjacent to coding and/or transcribed sequences. The term “gene” is used broadly to refer to any segment of nucleic acid associated with a biological function. Genes include coding sequences and/or the regulatory sequences required for their expression. Accordingly, a gene includes, but is not necessarily limited to, promoter sequences, terminators, translational regulatory sequences such as ribosome binding sites and internal ribosome entry sites, enhancers, silencers, insulators, boundary elements, replication origins, matrix attachment sites and locus control regions. For example, “gene” refers to a nucleic acid fragment that expresses mRNA, functional RNA, or specific protein, including regulatory sequences. “Functional RNA” refers to sense RNA, antisense RNA, ribozyme RNA, siRNA, or other RNA that may not be translated but yet has an effect on at least one cellular process. “Genes” also include non-expressed DNA segments that, for example, form recognition sequences for other proteins. “Genes” can be obtained from a variety of sources, including cloning from a source of interest or synthesizing from known or predicted sequence information, and may include sequences designed to have desired parameters.
“Gene expression” refers to the conversion of the information, contained in a gene, into a gene product. It refers to the transcription and/or translation of an endogenous gene, heterologous gene or nucleic acid segment, or a transgene in cells. In addition, expression refers to the transcription and stable accumulation of sense (mRNA) or functional RNA. Expression may also refer to the production of protein. The term “altered level of expression” refers to the level of expression in transgenic cells or organisms that differs from that of normal or untransformed cells or organisms.
A gene product can be the transcriptional product of a gene (e.g., mRNA, tRNA, rRNA, antisense RNA, ribozyme, structural RNA or any other type of RNA) or a protein produced by translation of an mRNA. Gene products also include RNAs that are modified, by processes such as capping, polyadenylation, methylation, and editing, and proteins modified by, for example, methylation, acetylation, phosphorylation, ubiquitination, ADP-ribosylation, myristilation, and glycosylation. The term “RNA transcript” refers to the product resulting from RNA polymerase-catalyzed transcription of a DNA sequence. When the RNA transcript is a complementary copy of the DNA sequence, it is referred to as the primary transcript; a RNA sequence derived from post-transcriptional processing of the primary transcript is referred to as the mature RNA. “Messenger RNA” (mRNA) refers to the RNA that lacks introns and that can be translated into protein by the cell. “cDNA” refers to a single- or a double-stranded DNA that is complementary to and derived from mRNA. “Functional RNA” refers to sense RNA, antisense RNA, ribozyme RNA, siRNA, or other RNA that may not be translated but yet has an effect on at least one cellular process.
A “coding sequence” or a sequence that “encodes” a polypeptide is a nucleic acid molecule that is transcribed (in the case of DNA) and/or translated (in the case of mRNA) into a polypeptide in vivo when placed under the control of appropriate regulatory sequences. The boundaries of the coding sequence are determined by a start codon at the 5′ (amino) terminus and a translation stop codon at the 3′ (carboxy) terminus. A coding sequence can include, but is not limited to, cDNA from viral, prokaryotic or eukaryotic mRNA, genomic DNA sequences from viral (e.g., DNA viruses and retroviruses) or prokaryotic DNA, and synthetic DNA sequences. A transcription termination sequence can be located 3′ to the coding sequence.
“Regulatory sequences” and “suitable regulatory sequences” each refer to nucleotide sequences located upstream (5′ non-coding sequences), within, or downstream (3′ non-coding sequences) of a coding sequence, and which influence the transcription, RNA processing or stability, or translation of the associated coding sequence. Regulatory sequences include enhancers, promoters, translation leader sequences, introns, and polyadenylation signal sequences. They include natural and synthetic sequences as well as sequences that may be a combination of synthetic and natural sequences.
Certain embodiments of the disclosure encompass isolated or substantially purified nucleic acid compositions. In the context of the present disclosure, an “isolated” or “purified” DNA molecule or RNA molecule is a DNA molecule or RNA molecule that exists apart from its native environment and is, therefore, not a product of nature. An isolated DNA molecule or RNA molecule may exist in a purified form or may exist in a non-native environment such as, for example, a transgenic host cell. For example, an “isolated” or “purified” nucleic acid molecule is substantially free of other cellular material, or culture medium when produced by recombinant techniques, or substantially free of chemical precursors or other chemicals when chemically synthesized. In one embodiment, an “isolated” nucleic acid is free of sequences that naturally flank the nucleic acid (i.e., sequences located at the 5′ and 3′ ends of the nucleic acid) in the genomic DNA of the organism from which the nucleic acid is derived.
By “fragment” is intended a polypeptide consisting of only a part of the intact full-length polypeptide sequence and structure. The fragment can include a C-terminal deletion, an N-terminal deletion, and/or an internal deletion of the native polypeptide. A fragment of a protein will generally include at least about 5-100 contiguous amino acid residues of the full-length molecule (e.g., at least about 15-25 contiguous amino acid residues of the full-length molecule, at least about 20-50 or more contiguous amino acid residues of the full-length molecule, or any integer between 5 amino acids and the full-length sequence).
“Naturally occurring” is used to describe a composition that can be found in nature as distinct from being artificially produced. For example, a nucleotide sequence present in an organism, which can be isolated from a source in nature and which has not been intentionally modified by a person in the laboratory, is naturally occurring.
A “5′ non-coding sequence” refers to a nucleotide sequence located 5′ (upstream) to the coding sequence. 5′ non-coding sequences are present in the fully processed mRNA upstream of the initiation codon and may affect processing of the primary transcript to mRNA, mRNA stability or translation efficiency. A “3′ non-coding sequence” refers to nucleotide sequences located 3′ (downstream) to a coding sequence and may include polyadenylation signal sequences and other sequences encoding regulatory signals capable of affecting mRNA processing or gene expression.
A “promoter” refers to a nucleotide sequence, usually upstream (5′) to its coding sequence, which directs and/or controls the expression of the coding sequence by providing the recognition for RNA polymerase and other factors required for proper transcription. “Promoter” can include a minimal promoter that is a short DNA sequence comprised of a TATA-box and other sequences that serve to specify the site of transcription initiation, to which regulatory elements are added for control of expression. “Promoter” also can refer to a nucleotide sequence that includes a minimal promoter plus one or more regulatory elements (e.g., enhancers) that are capable of controlling the expression of a coding sequence or functional RNA. Promoters may be derived in their entirety from a native sequence or be composed of different elements derived from different promoters found in nature, or even be comprised of synthetic DNA sequences. A promoter may also contain DNA sequences that are involved in the binding of protein factors that control the effectiveness of transcription initiation in response to physiological or developmental conditions. “Constitutive expression” refers to expression using a constitutive promoter. “Conditional” and “regulated expression” refer to expression controlled by a regulated promoter.
An “enhancer” is a DNA sequence that can stimulate promoter activity. An enhancer may be an innate element of the promoter or a heterologous element inserted to enhance the level or tissue specificity of a promoter. Enhancers often are capable of operating in both orientations and are capable of functioning even when moved either upstream or downstream from the promoter. Both enhancers and other regulatory elements within a promoter bind sequence-specific DNA-binding proteins that mediate their effects.
“Operably linked” refers to the association of nucleic acid sequences on a single nucleic acid fragment so that the function of one of the sequences is affected by another. For example, a regulatory DNA sequence is said to be “operably linked to” or “associated with” a DNA sequence that codes for an RNA or a polypeptide if the two sequences are situated such that the regulatory DNA sequence affects expression of the coding DNA sequence (i.e., that the coding sequence or functional RNA is under the transcriptional control of the promoter). Coding sequences can be operably linked to regulatory sequences in sense or antisense orientation.
“Expression” refers to the transcription and/or translation of an endogenous gene, heterologous gene or nucleic acid segment, or a transgene in cells. In addition, expression refers to the transcription and stable accumulation of sense (mRNA) or functional RNA. Expression may also refer to the production of protein. The term “altered level of expression” refers to a level of expression in cells or organisms that differs from that of normal cells or organisms.
For sequence comparison, typically one sequence acts as a reference sequence to which test sequences are compared. When using a sequence comparison algorithm, test and reference sequences are input into a computer, and sequence algorithm program parameters are designated. The sequence comparison algorithm then calculates the percent sequence identity for the test sequence(s) relative to the reference sequence, based on the designated algorithm parameters.
The following terms are used to describe the sequence relationships between two or more nucleic acids or polynucleotides: (a) “reference sequence,” (b) “comparison window,” (c) “sequence identity,” (d) “percentage of sequence identity,” and (e) “as is for sequence comparison. A reference sequence may be a subset or the substantial identity.” As used herein, “reference sequence” is a defined sequence used as a b entirety of a specified sequence; for example, as a segment of a full-length cDNA or gene sequence, or the complete cDNA or gene sequence. As used herein, “comparison window” makes reference to a contiguous and specified segment of a polynucleotide sequence, wherein the polynucleotide sequence in the comparison window may comprise additions or deletions (i.e., gaps) compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. Generally, the comparison window is at least 20 contiguous nucleotides in length, and optionally can be 30, 40, 50, 100, or longer. Those of skill in the art understand that, to avoid a high similarity to a reference sequence due to inclusion of gaps in the polynucleotide sequence, a gap penalty is typically introduced and is subtracted from the number of matches.
Methods of alignment of sequences for comparison are well-known in the art. Thus, the determination of percent identity between any two sequences can be accomplished using a mathematical algorithm. Non-limiting examples of such mathematical algorithms are the algorithm of Myers and Miller (Myers and Miller, CABIOS, 4, 11 (1988)); the local homology algorithm of Smith et al. (Smith et al., Adv. Appl. Math., 2, 482 (1981)); the homology alignment algorithm of Needleman and Wunsch (Needleman and Wunsch, JMB, 48, 443 (1970)); the search-for-similarity-method of Pearson and Lipman (Pearson and Lipman, Proc. Natl. Acad. Sci. USA, 85, 2444 (1988)); the algorithm of Karlin and Altschul (Karlin and Altschul, Proc. Natl. Acad. Sci. USA, 87, 2264 (1990)), modified as in Karlin and Altschul (Karlin and Altschul, Proc. Natl. Acad. Sci. USA 90, 5873 (1993)).
Computer implementations of these mathematical algorithms can be utilized for comparison of sequences to determine sequence identity. Such implementations include but are not limited to: CLUSTAL in the PC/Gene program (available from Intelligenetics, Mountain View, Calif.); the ALIGN program (Version 2.0) and GAP, BESTFIT, BLAST, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Version 8 (available from Genetics Computer Group (GCG), 575 Science Drive, Madison, Wis., USA). Alignments using these programs can be performed using the default parameters. The CLUSTAL program is well described by Higgins et al. (Higgins et al., CABIOS, 5, 151 (1989)); Corpet et al. (Corpet et al., Nucl. Acids Res., 16, 10881 (1988)); Huang et al. (Huang et al., CABIOS, 8, 155 (1992)); and Pearson et al. (Pearson et al., Meth. Mol. Biol., 24, 307 (1994)). The ALIGN program is based on the algorithm of Myers and Miller, supra. The BLAST programs of Altschul et al. (Altschul et al., J. Mol. Biol., 215, 403 (1990)) are based on the algorithm of Karlin and Altschul, supra.
Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information. This algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length “W” in the query sequence, which either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. “T” is referred to as the neighborhood word score threshold. These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are then extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters “M” (reward score for a pair of matching residues; always >0) and “N” (penalty score for mismatching residues; always <0), and for amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when the cumulative alignment score falls off by the quantity “X” from its maximum achieved value, the cumulative score goes to zero or below due to the accumulation of one or more negative-scoring residue alignments, or the end of either sequence is reached.
In addition to calculating percent sequence identity, the BLAST algorithm also performs a statistical analysis of the similarity between two sequences. One measure of similarity provided by the BLAST algorithm is the smallest sum probability (P(N)), which provides an indication of the probability by which a match between two nucleotide or amino acid sequences would occur by chance. For example, a test nucleic acid sequence is considered similar to a reference sequence if the smallest sum probability in a comparison of the test nucleic acid sequence to the reference nucleic acid sequence is less than about 0.1, less than about 0.01, or even less than about 0.001.
To obtain gapped alignments for comparison purposes, Gapped BLAST (in BLAST 2.0) can be utilized. Alternatively, PSI-BLAST (in BLAST 2.0) can be used to perform an iterated search that detects distant relationships between molecules. When utilizing BLAST, Gapped BLAST, PSI-BLAST, the default parameters of the respective programs (e.g., BLASTN for nucleotide sequences, BLASTX for proteins) can be used. The BLASTN program (for nucleotide sequences) uses as defaults a word length (W) of 11, an expectation (E) of 10, a cutoff of 100, M=5, N=−4, and a comparison of both strands. For amino acid sequences, the BLASTP program uses as defaults a word length (W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix. Alignment may also be performed manually by inspection.
For purposes of the present disclosure, comparison of nucleotide sequences for determination of percent sequence identity to the promoter sequences disclosed herein may be made using the BlastN program (version 1.4.7 or later) with its default parameters or any equivalent program. By “equivalent program” is intended any sequence comparison program that, for any two sequences in question, generates an alignment having identical nucleotide or amino acid residue matches and an identical percent sequence identity when compared to the corresponding alignment generated by the program.
As used herein, “sequence identity” or “identity” in the context of two nucleic acid or polypeptide sequences makes reference to a specified percentage of residues in the two sequences that are the same when aligned for maximum correspondence over a specified comparison window, as measured by sequence comparison algorithms or by visual inspection. When percentage of sequence identity is used in reference to proteins it is recognized that residue positions which are not identical often differ by conservative amino acid substitutions, where amino acid residues are substituted for other amino acid residues with similar chemical properties (e.g., charge or hydrophobicity) and, therefore, do not change the functional properties of the molecule. When sequences differ in conservative substitutions, the percent sequence identity may be adjusted upwards to correct for the conservative nature of the substitution. Sequences that differ by such conservative substitutions are said to have “sequence similarity” or “similarity.” Means for making this adjustment are well known to those of skill in the art. Typically, this involves scoring a conservative substitution as a partial rather than a full mismatch, thereby increasing the percentage sequence identity. Thus, for example, where an identical amino acid is given a score of 1 and a non-conservative substitution is given a score of zero, a conservative substitution is given a score between zero and 1. The scoring of conservative substitutions is calculated, e.g., as implemented in the program PC/GENE (Intelligenetics, Mountain View, Calif.).
As used herein, “percent sequence identity” means the value determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the polynucleotide sequence in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which the identical nucleic acid base or amino acid residue occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison, and multiplying the result by 100 to yield the percentage of sequence identity.
The term “substantial identity” of polynucleotide sequences means that a polynucleotide comprises a sequence that has at least 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, or 94%, or even at least 95%, 96%, 97%, 98%, 99% or 100% sequence identity, compared to a reference sequence using one of the alignment programs described herein using standard parameters. One of skill in the art will recognize that these values can be appropriately adjusted to determine corresponding identity of proteins encoded by two nucleotide sequences by taking into account codon degeneracy, amino acid similarity, reading frame positioning, and the like. Substantial identity of amino acid sequences for these purposes normally means sequence identity of at least 70% (e.g., 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%), at least 80% (e.g., 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%), at least 90% (e.g., 91%, 92%, 93%, or 94%), or even at least 95% (e.g., 96%, 97%, 98%, 99%, or 100%).
The term “substantial identity” in the context of a peptide indicates that a peptide comprises a sequence with at least 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, or 94%, or even 95%, 96%, 97%, 98% or 99%, sequence identity to the reference sequence over a specified comparison window. In certain embodiments, optimal alignment is conducted using the homology alignment algorithm of Needleman and Wunsch (Needleman and Wunsch, J. Mol. Biol., 48, 443 (1970)). An indication that two peptide sequences are substantially identical is that one peptide is immunologically reactive with antibodies raised against the second peptide. Thus, a peptide is substantially identical to a second peptide, for example, where the two peptides differ only by a conservative substitution. Thus, the disclosure also provides nucleic acid molecules and peptides that are substantially identical to the nucleic acid molecules and peptides presented herein.
Another indication that nucleotide sequences are substantially identical is if two molecules hybridize to each other under stringent conditions. Hybridization of nucleic acids is discussed in more detail below.
Oligonucleotide Primers and ProbesAs described herein, the methods provided in this disclosure rely upon oligonucleotides, sometimes referred to as primers or probes, to identify or detect features contained within the nucleic acid obtained from an individual, subject, or patient. The term “nucleic acid probe” or a “probe specific for” a nucleic acid refers to a nucleic acid sequence that has at least about 80%, e.g., at least about 90%, e.g., at least about 95% contiguous sequence identity or homology to the nucleic acid sequence encoding the targeted sequence of interest. A probe (or oligonucleotide or primer) of the disclosure is at least about 8 nucleotides in length (e.g., at least about 8-50 nucleotides in length, e.g., at least about 10-40, e.g., at least about 15-35 nucleotides in length). The oligonucleotide probes or primers of the disclosure may comprise at least about eight nucleotides at the 3′ of the oligonucleotide that have at least about 80%, e.g., at least about 85%, e.g., at least about 90%, e.g., at least about 95% contiguous identity to the targeted sequence of interest.
Primer pairs are useful for determination of the nucleotide sequence of a particular SNP using PCR. The pairs of single-stranded DNA primers can be annealed to sequences within or surrounding the SNP in order to prime amplifying DNA synthesis of the SNP itself. The first step of the process involves contacting a biological sample obtained from a subject, which sample contains nucleic acid, with at least one primer to form a hybridized DNA. The oligonucleotide primers that are useful in the methods of the present disclosure can be any primer comprised of about 8 bases up to about 80 or 100 bases or more. In one embodiment of the present disclosure, the primers are between about 10 and about 20 bases.
The primers themselves can be synthesized using techniques that are well known in the art. Generally, the primers can be made using oligonucleotide synthesizing machines that are commercially available.
The primers or probes of the present disclosure can be labeled using techniques known to those of skill in the art. For example, the labels used in the assays of disclosure can be primary labels (where the label comprises an element that is detected directly) or secondary labels (where the detected label binds to a primary label, e.g., as is common in immunological labeling). An introduction to labels (also called “tags”), tagging or labeling procedures, and detection of labels is found in Polak and Van Noorden (1997) Introduction to Immunocytochemistry, second edition, Springer Verlag, N.Y. and in Haugland (1996) Handbook of Fluorescent Probes and Research Chemicals, a combined handbook and catalogue Published by Molecular Probes, Inc., Eugene, Oreg. Primary and secondary labels can include undetected elements as well as detected elements. Useful primary and secondary labels in the present disclosure can include spectral labels such as fluorescent dyes (e.g., fluorescein and derivatives such as fluorescein isothiocyanate (FITC) and Oregon Green™ rhodamine and derivatives (e.g., Texas red, tetramethylrhodamine isothiocyanate (TRITC), etc.), digoxigenin, biotin, phycoerythrin, AMCA, CyDyes™, and the like), radiolabels (e.g., 3H, 125I, 35S, 14C, 32P, 33P), enzymes (e.g., horse-radish peroxidase, alkaline phosphatase) spectral colorimetric labels such as colloidal gold or colored glass or plastic (e.g., polystyrene, polypropylene, latex) beads. The label may be coupled directly or indirectly to a component of the detection assay (e.g., the labeled nucleic acid) according to methods well known in the art. As indicated above, a wide variety of labels may be used, with the choice of label depending on sensitivity required, ease of conjugation with the compound, stability requirements, available instrumentation, and disposal provisions.
In general, a detector that monitors a probe-substrate nucleic acid hybridization is adapted to the particular label that is used. Typical detectors include spectrophotometers, phototubes and photodiodes, microscopes, scintillation counters, cameras, film and the like, as well as combinations thereof. Examples of suitable detectors are widely available from a variety of commercial sources known to persons of skill. Commonly, an optical image of a substrate comprising bound labeled nucleic acids is digitized for subsequent computer analysis.
Labels include those that use (1) chemiluminescence (using Horseradish Peroxidase and/or Alkaline Phosphatase with substrates that produce photons as breakdown products) with kits being available, e.g., from Molecular Probes, Amersham, Boehringer-Mannheim, and Life Technologies/Gibco BRL; (2) color production (using both Horseradish Peroxidase and/or Alkaline Phosphatase with substrates that produce a colored precipitate) (kits available from Life Technologies/Gibco BRL, and Boehringer-Mannheim); (3) hemifluorescence using, e.g., Alkaline Phosphatase and the substrate AttoPhos (Amersham) or other substrates that produce fluorescent products, (4) fluorescence (e.g., using Cy-5 (Amersham), fluorescein, and other fluorescent labels); (5) radioactivity using kinase enzymes or other end-labeling approaches, nick translation, random priming, or PCR to incorporate radioactive molecules into the labeled nucleic acid. Other methods for labeling and detection will be readily apparent to one skilled in the art.
Fluorescent labels can be used and have the advantage of requiring fewer precautions in handling and being amendable to high-throughput visualization techniques (optical analysis including digitization of the image for analysis in an integrated system comprising a computer). Preferred labels are typically characterized by one or more of the following: high sensitivity, high stability, low background, low environmental sensitivity and high specificity in labeling. Fluorescent moieties, which can be incorporated into a label, generally are known including Texas red, dixogenin, biotin, 1- and 2-aminonaphthalene, p,p′-diaminostilbenes, pyrenes, quaternary phenanthridine salts, 9-aminoacridines, p,p′-diaminobenzophenone imines, anthracenes, oxacarbocyanine, merocyanine, 3-aminoequilenin, perylene, bis-benzoxazole, bis-p-oxazolyl benzene, 1,2-benzophenazin, retinol, bis-3-aminopyridinium salts, hellebrigenin, tetracycline, sterophenol, benzimidazolylphenylamine, 2-oxo-3-chromen, indole, xanthen, 7-hydroxycoumarin, phenoxazine, calicylate, strophanthidin, porphyrins, triarylmethanes, flavin and many others. Many fluorescent labels are commercially available from the SIGMA Chemical Company (Saint Louis, MO), Molecular Probes, R&D systems (Minneapolis, MN), Pharmacia LKB Biotechnology (Piscataway, NJ), CLONTECH Laboratories, Inc. (Palo Alto, CA), Chem Genes Corp., Aldrich Chemical Company (Milwaukee, WI), Glen Research, Inc., GIBCO BRL Life Technologies, Inc. (Gaithersberg, MD), Fluka ChemicaBiochemika Analytika (Fluka Chemie AG, Buchs, Switzerland), and Applied Biosystems™ (Foster City, CA), as well as many other commercial sources known to one of skill.
Means of detecting and quantifying labels are well known to those of skill in the art. Thus, for example, when the label is a radioactive label, means for detection include a scintillation counter or photographic film as in autoradiography; and when the label is optically detectable, typical detectors include microscopes, cameras, phototubes, photodiodes, and many other detection systems that are widely available.
Oligonucleotide primers or probes may be prepared having any of a wide variety of base sequences according to techniques that are well known in the art. Suitable bases for preparing an oligonucleotide primer or probe may be selected from naturally occurring nucleotide bases such as adenine, cytosine, guanine, uracil, and thymine; and non-naturally occurring or “synthetic” nucleotide bases such as 7-deaza-guanine 8-oxo-guanine, 6-mercaptoguanine, 4-acetylcytidine, 5-(carboxyhydroxyethyl)uridine, 2′-O-methylcytidine, 5-carboxymethylamino-methyl-2-thioridine, 5-carboxymethylaminomethyluridine, dihydrouridine, 2′-O-methylpseudouridine, β,D-galactosylqueosine, 2′-O-methylguanosine, inosine, N6-isopentenyladenosine, 1-methyladenosine, 1-methylpseeudouridine, 1-methylguanosine, 1-methylinosine, 2,2-dimethylguanosine, 2-methyladenosine, 2-methylguanosine, 3-methylcytidine, 5-methylcytidine, N6-methyladenosine, 7-methylguanosine, 5-methylamninomethyluridine, 5-methoxyaminomethyl-2-thiouridine, β,D-mannosylqueosine, 5-methloxycarbonylmethyluridine, 5-methoxyuridine, 2-methyltio-N6-isopentenyladenosine, N-((9-β-D-ribofuranosyl-2-methylthiopurine-6-yl)carbamoyl)threonine, N-((9-β-D-ribofuranosylpurine-6-yl)N-methyl-carbamoyl)threonine, uridine-5-oxyacetic acid methylester, uridine-5-oxyacetic acid, wybutoxosine, pseudouridine, queosine, 2-thiocytidine, 5-methyl-2-thiouridine, 2-thiouridine, 2-thiouridine, 5-Methylurdine, N-((9-beta-D-ribofuranosylpurine-6-yl)carbamoyl)threonine, 2′-O-methyl-5-methyluridine, 2′-O-methylurdine, wybutosine, and 3-(3-amino-3-carboxypropyl)uridine. Any oligonucleotide backbone may be employed, including DNA, RNA (although RNA is less preferred than DNA), modified sugars such as carbocycles, and sugars containing 2′ substitutions such as fluoro and methoxy. The oligonucleotides may be oligonucleotides wherein at least one, or all, of the internucleotide bridging phosphate residues are modified phosphates, such as methyl phosphonates, methyl phosphonotlioates, phosphoroinorpholidates, phosphoropiperazidates and phosplioramidates (for example, every other one of the internucleotide bridging phosphate residues may be modified as described). The oligonucleotide may be a “peptide nucleic acid” such as described in Nielsen et al., Science, 254:1497-1500 (1991).
As used herein, a “single base pair extension probe” is a nucleic acid that selectively recognizes a single nucleotide polymorphism (i.e., either the A or the G of an A/G polymorphism). Generally, these probes take the form of a DNA primer (e.g., as in PCR primers) that are modified so that incorporation of the primer releases a fluorophore. One example of this is a Taqman® probe that uses the 5′ exonuclease activity of the enzyme Taq Polymerase for measuring the amount of target sequences in the samples. TaqMan® probes consist of a 18-22 bp oligonucleotide probe, which is labeled with a reporter fluorophore at the 5′ end, and a quencher fluorophore at the 3′ end. Incorporation of the probe molecule into a PCR chain (which occurs because the probe set is contained in a mixture of PCR primers) liberates the reporter fluorophore from the effects of the quencher. The primer must be able to recognize the target binding site. Some primer extension probes can be “activated” directly by DNA polymerase without a full PCR extension cycle.
The only requirement is that the oligonucleotide probe should possess a sequence at least a portion of which is capable of binding to a known portion of the sequence of the DNA sample. The nucleic acid probes provided by the present disclosure are useful for a number of purposes.
Methods of Detecting Nucleic Acids A. AmplificationAccording to the methods of the present disclosure, the amplification of DNA present in a biological sample may be carried out by any means known to the art. Examples of suitable amplification techniques include, but are not limited to, polymerase chain reaction (including, for RNA amplification, reverse-transcriptase polymerase chain reaction), ligase chain reaction, strand displacement amplification, transcription-based amplification, self-sustained sequence replication (or “3SR”), the Qbeta replicase system, nucleic acid sequence-based amplification (or “NASBA”), the repair chain reaction (or “RCR”), and boomerang DNA amplification (or “BDA”).
The bases incorporated into the amplification product can be natural or modified bases (modified before or after amplification), and the bases can be selected to optimize subsequent detection steps (e.g., electrochemical detection steps).
Polymerase chain reaction (PCR) can be carried out in accordance with known techniques. See, e.g., U.S. Pat. Nos. 4,683,195; 4,683,202; 4,800,159; and 4,965,188. In general, PCR involves, first, treating a nucleic acid sample (e.g., in the presence of a heat stable DNA polymerase) with one oligonucleotide primer for each strand of the specific sequence to be detected under hybridizing conditions so that an extension product of each primer is synthesized that is complementary to each nucleic acid strand, with the primers sufficiently complementary to each strand of the specific sequence to hybridize therewith so that the extension product synthesized from each primer, when it is separated from its complement, can serve as a template for synthesis of the extension product of the other primer, and then treating the sample under denaturing conditions to separate the primer extension products from their templates if the sequence or sequences to be detected are present. These steps are cyclically repeated until the desired degree of amplification is obtained. Detection of the amplified sequence may be carried out by adding, to the reaction product, an oligonucleotide probe capable of hybridizing to the reaction product (e.g., an oligonucleotide primer or probe of the present disclosure), the probe carrying a detectable label, and then detecting the label in accordance with known techniques. Various labels that can be incorporated into or operably linked to nucleic acids are well known in the art, such as radioactive, enzymatic, and florescent labels. Where the nucleic acid to be amplified is RNA, amplification may be carried out by initial conversion to DNA by reverse transcriptase in accordance with known techniques.
Strand displacement amplification (SDA) can be carried out in accordance with known techniques. For example, SDA can be carried out with a single amplification primer or a pair of amplification primers, with exponential amplification being achieved with the latter. In general, SDA amplification primers comprise, in the 5′ to 3′ direction, a flanking sequence (the DNA sequence of which is noncritical), a restriction site for the restriction enzyme employed in the reaction, and an oligonucleotide sequence (e.g., an oligonucleotide primer or probe as described herein) that hybridizes to the target sequence to be amplified and/or detected. The flanking sequence, which serves to facilitate binding of the restriction enzyme to the recognition site and provides a DNA polymerase priming site after the restriction site has been nicked, can be about 15 to 20 nucleotides in length. The restriction site is functional in the SDA reaction. For example, the oligonucleotide primer or probe portion can be about 13 to 15 nucleotides in length.
Ligase chain reaction (LCR) also can be carried out in accordance with known techniques. In general, the reaction is carried out with two pairs of oligonucleotide probes: one pair binds to one strand of the sequence to be detected; the other pair binds to the other strand of the sequence to be detected. Each pair together completely overlaps the strand to which it corresponds. The reaction is carried out by, first, denaturing (e.g., separating) the strands of the sequence to be detected, then reacting the strands with the two pairs of oligonucleotide probes in the presence of a heat stable ligase so that each pair of oligonucleotide probes is ligated together, then separating the reaction product, and then cyclically repeating the process until the sequence has been amplified to the desired degree. Detection then can be carried out in like manner as described above with respect to PCR.
According to the methods described herein, a particular SNP at a particular locus can be detected. Techniques that are useful in the methods described herein include, but are not limited to, direct DNA sequencing, PFGE analysis, allele-specific oligonucleotide (ASO), dot blot analysis and denaturing gradient gel electrophoresis, and are well known to a skilled artisan.
There are several methods that can be used to detect DNA sequence variation. Direct DNA sequencing, either manual sequencing or automated fluorescent sequencing can detect sequence variation. Another approach is the single-stranded conformation polymorphism assay (SSCA). This method does not detect all sequence changes, especially if the DNA fragment size is greater than 200 bp but can be optimized to detect most DNA sequence variation. The reduced detection sensitivity is a disadvantage, but the increased throughput possible with SSCA makes it an attractive, viable alternative to direct sequencing for mutation detection on a research basis. The fragments that have shifted mobility on SSCA gels then can be sequenced to determine the exact nature of the DNA sequence variation. Other approaches based on the detection of mismatches between the two complementary DNA strands include clamped denaturing gel electrophoresis (CDGE), heteroduplex analysis (HA) and chemical mismatch cleavage (CMC). Once a sequence change has been identified, an allele specific detection approach such as allele specific oligonucleotide (ASO) hybridization can be utilized to rapidly screen large numbers of other samples for that same sequence change (e.g., mutation, polymorphism). Such a technique can utilize probes that are labeled with gold nanoparticles to yield a visual color result.
Detection of SNPs can be accomplished by sequencing the desired target region using techniques well known in the art. Alternatively, sequences can be amplified directly from a genomic DNA preparation from subject tissue using known techniques. The DNA sequence of the amplified sequences then can be determined.
There are several well-known methods for a more complete, yet still indirect, test for confirming the presence of a mutant allele: 1) single stranded conformation analysis (SSCA); 2) denaturing gradient gel electrophoresis (DGGE); 3) RNase protection assays; 4) allele-specific oligonucleotides (ASOs); 5) the use of proteins which recognize nucleotide mismatches, such as the E. coli mutS protein; and/or 6) allele-specific PCR. For allele-specific PCR, primers are used that hybridize at their 3′ ends to a particular allele. If the particular mutation is not present, an amplification product is not observed. Amplification Refractory Mutation System (ARMS) can also be used. Insertions and deletions of genes can also be detected by cloning, sequencing, and amplification. In addition, restriction fragment length polymorphism (RFLP) probes for the gene or surrounding marker genes can be used to score alteration of an allele or an insertion in a polymorphic fragment. Other techniques for detecting insertions and deletions as known in the art can be used.
In the first three methods (SSCA, DGGE and RNase protection assay), a new electrophoretic band appears. SSCA detects a band that migrates differentially because the sequence change causes a difference in single-strand, intramolecular base pairing. RNase protection involves cleavage of the mutant polynucleotide into two or more smaller fragments. DGGE detects differences in migration rates of mutant sequences compared to wild-type sequences, using a denaturing gradient gel. In an allele-specific oligonucleotide assay, an oligonucleotide is designed which detects a specific sequence, and the assay is performed by detecting the presence or absence of a hybridization signal. In the mutS assay, the protein binds only to sequences that contain a nucleotide mismatch in a heteroduplex between mutant and wild-type sequences.
Mismatches, according to the present disclosure, are hybridized nucleic acid duplexes in which the two strands are not 100% complementary. Lack of total homology may be due to deletions, insertions, inversions, or substitutions. Mismatch detection can be used to detect point mutations in the gene or in its mRNA product. While these techniques are less sensitive than sequencing, they are simpler to perform on a large number of samples. An example of a mismatch cleavage technique is the RNase protection method. The riboprobe and either mRNA or DNA isolated from the tumor tissue are annealed (hybridized) together and subsequently digested with the enzyme RNase A to detect some mismatches in a duplex RNA structure. If a mismatch is detected by RNase A, it cleaves at the site of the mismatch. Thus, when the annealed RNA preparation is separated on an electrophoretic gel matrix, if a mismatch has been detected and cleaved by RNase A, an RNA product will be seen which is smaller than the full-length duplex RNA for the riboprobe and the mRNA or DNA. The riboprobe need not be the full length of the mRNA or gene but can be a segment of either. If the riboprobe includes only a segment of the mRNA or gene, it will be desirable to use a number of these probes to screen the whole mRNA sequence for mismatches.
In similar fashion, DNA probes can be used to detect mismatches, through enzymatic or chemical cleavage. Alternatively, mismatches can be detected by shifts in the electrophoretic mobility of mismatched duplexes relative to matched duplexes. With either riboprobes or DNA probes, the cellular mRNA or DNA that might contain a mutation can be amplified using PCR before hybridization.
B. SequencingDue to its sensitivity and relative simplicity in terms of both workflow and technique, Sanger sequencing is used in a variety of applications from targeted sequencing to confirming variants identified using orthogonal methods. Sanger sequencing utilizes a chain-termination method to provide the identity and order of nucleotide bases in a given strand of DNA. This method makes use of chemical analogues of the four nucleotide bases (i.e., ddNTPs), which are missing the hydroxyl group required for extension of the polynucleotide chains that form the DNA molecule. By mixing radiolabeled, and, later, fluorescent labeled, ddNTPs with template DNA, strands of each possible length are produced when the ddNTPs get randomly incorporated, terminating the chain. In contrast to Sanger sequencing, the Maxam and Gilbert used a chemical cleavage technique. The chief advantages of the Maxam-Gilbert technique compared with Sanger's method are that sequencing could be done from the original DNA fragment, instead of from enzymatic copies, no PCR is required, and this method is less susceptible to mistakes of secondary structures or enzymatic mistakes. The products generated in sequencing reactions can be resolved by ionophoresis on acrylamide gels or using capillary electrophoresis.
Another sequencing technique referred to as pyrosequencing was developed that uses a two-enzyme process in which adenosine triphosphate (ATP) sulfurylase is used to convert pyrophosphate into ATP, which is then used as the substrate for luciferase, thus producing light proportional to the amount of pyrophosphate. Additional approaches for sequencing nucleic acids include emulsion polymerase chain reaction (PCR), reversible terminator, and sequencing by oligonucleotide ligation and detection. Capillary electrophoresis (CE) instruments also can be used for sequencing.
High-throughput sequencing techniques also have been developed, termed next-generation sequencing (NGS). NGS is massively parallel, sequencing millions of fragments simultaneously per run. This high-throughput process translates into sequencing hundreds to thousands of genes at one time. NGS also offers greater discovery power to detect novel or rare variants with deep sequencing. The spectrum of analysis of NGS can extend from a small number of genes to an entire genome. Whole-genome sequencing (WGS) and whole-exome sequencing (WES) provide the sequence of DNA bases across the genome and exome, respectively. Whole-transcriptome sequencing provides sequence information about coding and multiple noncoding forms of RNA to assess variations and gene expression levels across the entire transcriptome. Targeted sequencing covers a relatively small set of genes or targeted regions of interest (e.g., to determine the presence or absence of a SNP). Real-time sequencing and single-molecule sequencing (SMS), capable of accurately sequencing long strands of nucleic acid without an intermediary and without previous transcription or amplification also have been developed.
Nanopore-based sequencing technology detects the unique electrical signals of different molecules as they pass through the nanopore with a semiconductor-based electronic detection system. This technology makes for a high throughput, cost effective sequencing solution. At the heart of the technology is the biological nanopore, a protein pore embedded in a membrane, while the brains of the technology lie in the electronics of a semiconductor integrated circuit and proprietary chemistries. The electronic sensor technology embedded in the chip enables automatic membrane assembly and nanopore insertion, while allowing for active control of individual sensors on the circuit. See, e.g., Oxford Nanopore and Pacific Biosciences. Nanopore and electronic sensor sequencing technology can be used to directly determine DNA methylation using native, non-bisulfite DNA.
C. HybridizationThe phrase “hybridizing specifically to” refers to the binding, duplexing, or hybridizing of a molecule only to a particular nucleotide sequence under stringent conditions when that sequence is present in a complex mixture (e.g., total cellular) DNA or RNA. “Bind(s) substantially” refers to complementary hybridization between a primer or probe nucleic acid and a target nucleic acid and embraces minor mismatches that can be accommodated by reducing the stringency of the hybridization media to achieve the desired detection of the target nucleic acid sequence.
Generally, stringent conditions are selected to be about 5° C. lower than the thermal melting point (Tm) for the specific sequence at a defined ionic strength and pH. However, stringent conditions encompass temperatures in the range of about 1° C. to about 20° C., depending upon the desired degree of stringency as otherwise qualified herein. Nucleic acids that do not hybridize to each other under stringent conditions are still substantially identical if the polypeptides they encode are substantially identical. This may occur, e.g., when a copy of a nucleic acid is created using the maximum codon degeneracy permitted by the genetic code. One indication that two nucleic acid sequences are substantially identical is when the polypeptide encoded by the first nucleic acid is immunologically cross reactive with the polypeptide encoded by the second nucleic acid.
“Stringent conditions” are those that (1) employ low ionic strength and high temperature for washing, for example, 0.015 M NaCl/0.0015 M sodium citrate (SSC); 0.1% sodium lauryl sulfate (SDS) at 50° C., or (2) employ a denaturing agent such as formamide during hybridization, e.g., 50% formamide with 0.1% bovine serum albumin/0.1% Ficoll/0.1% polyvinylpyrrolidone/50 mM sodium phosphate buffer at pH 6.5 with 750 mM NaCl, 75 mM sodium citrate at 42° C. Another example is use of 50% formamide, 5×SSC (0.75 M NaCl, 0.075 M sodium citrate), 50 mM sodium phosphate (pH 6.8), 0.1% sodium pyrophosphate, 5×Denhardt's solution, sonicated salmon sperm DNA (50 μg/ml), 0.1% SDS, and 10% dextran sulfate at 42° C., with washes at 42° C. in 0.2×SSC and 0.1% SDS. Other examples of stringent conditions are well known in the art.
“Stringent hybridization conditions” and “stringent hybridization wash conditions” in the context of nucleic acid hybridization experiments such as Southern and Northern hybridizations are sequence dependent and are different under different environmental parameters. Longer sequences hybridize specifically at higher temperatures. The thermal melting point (Tm) is the temperature (under defined ionic strength and pH) at which 50% of the target sequence hybridizes to a perfectly matched primer or probe sequence. Specificity is typically the function of post-hybridization washes, the critical factors being the ionic strength and temperature of the final wash solution. For DNA-DNA hybrids, the Tm can be approximated from the equation of Meinkoth and Wahl (1984, Anal. Biochem., 138(2):267-84); Tm 81.5° C.+16.6 (log M)+0.41 (% GC)−0.61 (% form)−500/L; where M is the molarity of monovalent cations, % GC is the percentage of guanosine and cytosine nucleotides in the DNA, % form is the percentage of formamide in the hybridization solution, and L is the length of the hybrid in base pairs. Tm is reduced by about 1° C. for each 1% of mismatching; thus, Tm, hybridization, and/or wash conditions can be adjusted to hybridize to sequences of the desired identity. For example, if sequences with >90% identity are sought, the Tm can be decreased 10° C. Generally, stringent conditions are selected to be about 5° C. lower than the Tm for the specific sequence and its complement at a defined ionic strength and pH. However, severely stringent conditions can utilize a hybridization and/or wash at 1, 2, 3, or 4° C. lower than the Tm; moderately stringent conditions can utilize a hybridization and/or wash at 6, 7, 8, 9, or 10° C. lower than the Tm; low stringency conditions can utilize a hybridization and/or wash at 11, 12, 13, 14, 15, or 20° C. lower than the Tm. Using the equation, hybridization and wash compositions, and desired temperature, those of ordinary skill will understand that variations in the stringency of hybridization and/or wash solutions are inherently described. If the desired degree of mismatching results in a temperature of less than 45° C. (aqueous solution) or 32° C. (formamide solution), the SSC concentration can be increased so that a higher temperature can be used. Generally, highly stringent hybridization and wash conditions are selected to be about 5° C. lower than the Tm for the specific sequence at a defined ionic strength and pH.
An example of highly stringent wash conditions is 0.15 M NaCl at 72° C. for about 15 minutes. An example of stringent wash conditions is a 0.2×SSC wash at 65° C. for 15 minutes. Often, a high stringency wash is preceded by a low stringency wash to remove background signal. An example of a medium stringency wash for a duplex of, e.g., more than 100 nucleotides, is 1×SSC at 45° C. for 15 minutes. For short nucleotide sequences (e.g., about 10 to 50 nucleotides), stringent conditions typically involve salt concentrations of less than about 1.5 M, less than about 0.01 to 1.0 M, Na ion concentration (or other salts) at pH 7.0 to 8.3, and the temperature is typically at least about 30° C. and at least about 60° C. for long oligonucleotides (e.g., >50 nucleotides). Stringent conditions also can be achieved by the addition of destabilizing agents such as formamide. In general, a signal to noise ratio of 2× (or higher) than that observed for an unrelated oligonucleotide in the particular hybridization assay indicates detection of a specific hybridization. Nucleic acids that do not hybridize to each other under stringent conditions are still substantially identical if the proteins that they encode are substantially identical. This can occur, e.g., when a copy of a nucleic acid is created using the maximum codon degeneracy permitted by the genetic code.
Very stringent conditions can be equal to the Tm for a particular oligonucleotide. An example of stringent conditions for hybridization of complementary nucleic acids that have more than 100 complementary residues on a filter in a Southern or Northern blot is 50% formamide, e.g., hybridization in 50% formamide, 1 M NaCl, 1% SDS at 37° C., and a wash in 0.1×SSC at 60 to 65° C. Exemplary low stringency conditions include hybridization with a buffer solution of 30 to 35% formamide, 1 M NaCl, 1% SDS (sodium dodecyl sulphate) at 37° C., and a wash in 1× to 2×SSC (20×SSC=3.0 M NaCl/0.3 M trisodium citrate) at 50 to 55° C. Exemplary moderate stringency conditions include hybridization in 40 to 45% formamide, 1.0 M NaCl, 1% SDS at 37° C., and a wash in 0.5× to 1× SSC at 55 to 60° C.
“Northern analysis” or “Northern blotting” is a method used to identify RNA sequences that hybridize to a known probe such as an oligonucleotide, DNA fragment, cDNA or fragment thereof, or RNA fragment. The probe can be labeled with a radioisotope such as 32P, by biotinylation or with an enzyme. The RNA to be analyzed can be usually electrophoretically separated on an agarose or polyacrylamide gel, transferred to nitrocellulose, nylon, or other suitable membrane, and hybridized with the probe, using standard techniques well known in the art.
Nucleic acid sample may be contacted with an oligonucleotide in any suitable manner known to those skilled in the art. For example, the DNA sample may be solubilized in solution, and contacted with the oligonucleotide by solubilizing the oligonucleotide in solution with the DNA sample under conditions that permit hybridization. Suitable conditions are well known to those skilled in the art. Alternatively, the DNA sample may be solubilized in solution with the oligonucleotide immobilized on a solid support, whereby the DNA sample may be contacted with the oligonucleotide by immersing the solid support having the oligonucleotide immobilized thereon in the solution containing the DNA sample.
The term “substrate” refers to any solid support to which an oligonucleotide may be attached. The substrate material may be modified, covalently or otherwise, with coatings or functional groups to facilitate binding of oligonucleotides. Suitable substrate materials include polymers, glasses, semiconductors, papers, metals, gels and hydrogels among others. Substrates may have any physical shape or size, e.g., plates, strips, or microparticles. The term “spot” refers to a distinct location on a substrate to which oligonucleotides of known sequence are attached. A spot may be an area on a planar substrate, or it may be, for example, a microparticle distinguishable from other microparticles. The term “bound” means affixed to the solid substrate. A spot is “bound” to the solid substrate when it is affixed in a particular location on the substrate for purposes of the screening assay.
In certain embodiments of the present disclosure, the substrate is a polymer, glass, semiconductor, paper, metal, gel or hydrogel. In certain embodiments of the present disclosure, a kit can further include a solid substrate and at least one control oligonucleotide, wherein the at least one control oligonucleotide is bound onto the substrate in a distinct spot.
In certain embodiments of the present disclosure, the solid substrate is a microarray. An “array” or “microarray” is used synonymously herein to refer to a plurality of primers or probes attached to one or more distinguishable spots on a substrate. A microarray may include a single substrate or a plurality of substrates, for example a plurality of beads or microspheres. A “copy” of a microarray contains the same types and arrangements of primer or probes.
Methods for Detecting or Predicting Cardiovascular Disease or Estimating SurvivalBetter risk assessment for, or earlier detection of, cardiovascular disease is the first step toward more effective prevention. Those identified as being at higher risk (e.g., PPV of 69% for CHD) for CHD, CHD events, or as having CHD can be followed up promptly for further testing such as with coronary calcium or angiography, and more aggressive interventions. They can be tested or re-tested periodically for determining severity, identifying, customizing, optimizing intervention(s) (e.g., lifestyle, medical, therapeutics), management and monitoring. Conversely, those at lower risk (e.g., NPV of 99% for CHD) or those free of CHD can be re-tested periodically to determine severity, identify, customize, optimize intervention(s) (e.g., lifestyle, medical, therapeutics), managed and monitored to ensure continued prevention due to the dynamic nature of DNA methylation. In addition, those with CVD (e.g., CHD) can be evaluated and their survivability estimated based on their genetic and/or epigenetic profile. It would be appreciated that, under some circumstances (e.g., for monitoring or follow-up purposes; for evaluating survivability of individuals at risk of or already determined to have CVD), the steps involved in detecting one or more SNPs may not be required, as that information may already be available (e.g., from a previous determination) or may not be necessary to predict CVD or survivability from CVD or severity or to identify, customize and optimize intervention(s) or to manage an individual.
Compared to the integrated genetic-epigenetic model, overall, conventional risk factors-based calculators and/or other detection tests such as stress test were considerably less sensitive, less generalizable, and also depicted a gender gap in performance. In contrast, the integrated genetic-epigenetic model described herein has the ability to capture and better understand the complex nature of CVD via three angles, genetics (inherited risk that is static), DNA methylation (acquired risk that is dynamic) and the genetic confounding of methylation signatures (i.e., G+M+G×M).
The present disclosure provides a method for determining whether a subject has a likelihood of having CVD by determining methylation status of a CpG dinucleotide repeat or CpG dinucleotide repeat motif region, where the methylation status of the CpG dinucleotide is associated with CVD. However, the same principals apply to the assessment of the prevalence and/or incidence of a number of different types of CVD including, without limitation, coronary heart disease (CHD) (e.g., obstructive CHD), stroke, arrhythmia, cardiac arrest, congestive heart failure, atherosclerotic cardiovascular disease (ASCVD) and its associated cardiovascular events (CVE) including, for example, obstructive coronary artery disease (CAD), ischemia with no obstructive coronary arteries (INOCA), myocardial infarction (MI), stroke (e.g., TIA, hemorrhagic), and cardiovascular death. The present disclosure also provides methods for determining severity and/or estimating the survival of a subject having or at risk of having CVD. In certain embodiments, the method determines the methylation status of a plurality (e.g., any integer between 1 and 10,000, such as at least 100) of CpG dinucleotides and/or SNPs.
As used herein, a “biological sample” encompasses essentially any sample type obtained from a subject that can be used in a method as described herein. The biological sample may be any bodily fluid, tissue or any other sample from which clinically relevant biomarker levels may be determined. “Biological samples” also can encompass cells in culture, cell supernatants, cell lysates, blood, serum, plasma, urine, cerebral spinal fluid, biological fluid, and tissue samples. Various techniques and reagents find use in the methods of the present disclosure. In one embodiment of the disclosure, blood samples, or samples derived from blood, e.g., plasma, circulating, peripheral, lymphocytes, etc., are assayed for the presence of one or more SNPs and/or the methylation status of one or more CpG dinucleotides. A biological sample also can be saliva. Typically, a biological sample that contains nucleic acids is provided and tested. Biological samples can be derived from subjects using well known techniques such as finger prick, venipuncture, lumbar puncture, fluid sample such as saliva or urine, or tissue biopsy and the like.
As used herein, the term “healthy” means that a subject does not manifest a particular condition and is no more likely than at random to be susceptible to a particular condition.
Prevalence is defined by the American Psychological Association (APA) as the “the total number or percentage of cases (e.g., of a disease or disorder) existing in a population” (APA Dictionary of Psychology, (American Psychological Association, Washington, D C, 2007)). In some instances, point prevalence is used to describe the prevalence of cases at a discrete point of time, and period prevalence is used to describe the number of cases that exist for a period of time (e.g., a month, a year). Prevalence typically is expressed as a rate per population unit (e.g., number of cases per 100,000 people) instead of an absolute number or a percent.
Similarly, incidence is defined by the APA as “the rate of occurrence of new cases of a given event or condition (e.g., a disorder, disease, symptom, or injury) in a particular population in a given period” of time (APA Dictionary of Psychology, (American Psychological Association, Washington, D C, 2007)). As used herein, the term “incidence” is defined as a tendency or susceptibility for a subject to manifest a condition, in this case, CVD (e.g., CHD). In some instances, the period of time can be a year or less than a year; in some instances, the period of time can be longer than a year (e.g., two years, five years, ten years).
Diagnosis is defined by the APA as the “process of identifying and determining the nature of a disease or disorder by its signs and symptoms, through the use of assessment techniques (e.g., tests and examinations) and other available evidence” (APA Dictionary of Psychology, (American Psychological Association, Washington, D C, 2007)). A diagnosis can refer to the present time period, or to a time period in the past or the future.
Likewise, prognosis is defined by the APA as “a prediction of the course, duration, severity, and outcome of a condition, disease, or disorder” (APA Dictionary of Psychology, (American Psychological Association, Washington, D C, 2007)). A prognosis can be made, for example, over a period of one month, six months, one year, five years, ten years, or longer.
Risk assessment is defined as a “study of a subject done for the purpose of trying to determine the probability that that person will develop a particular disease or, if the disease is already present, the probability that the person will suffer exacerbation of it or death from it” (Youngson, 2005, Collins Dictionary of Medicine). In some instances, risk assessment is based on conditions or events and not on disease. In some instances, a risk assessment is determined over a period of time (e.g., months, years).
Survival or survivability, as used herein, typically refers to the number of days an individual has remaining to live, measured from the point in time the biological sample (e.g., a saliva sample, a blood draw, etc.) was obtained and used to make this determination.
It would be understood by a skilled artisan that, in certain instances, “detection” of, e.g., a biomarker (e.g., the presence or absence of a biomarker) can result in a “diagnosis.” Similarly, it would be understood by a skilled artisan that, in certain instances, “detection” and “diagnosis” are used interchangeably.
Biomarkers are described herein that can be used in methods (e.g., predictive or prognostic) of detecting CVD in a subject or estimating survival in a subject having CVD or at risk of having CVD. Such methods typically include providing a biological sample from the subject; contacting DNA from the biological sample with bisulfite under alkaline conditions; contacting the bisulfite-treated DNA with at least one first oligonucleotide primer at least 8 nucleotides in length that is complementary to a sequence that comprises a CpG dinucleotide (e.g., at a GC locus referred to as cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, or another biomarker from Appendix A); and determining the methylation status of the CpG dinucleotide. It would be understood that the at least one first oligonucleotide probe can detect either the unmethylated CpG dinucleotide or the methylated CpG dinucleotide. Such a method can further include determining the genotype of a single nucleotide polymorphism (SNP) (e.g., rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433, or another biomarker from Appendix C) or a second SNP in linkage disequilibrium with the first SNP. As described herein, methylation of one or more particular CpG dinucleotides and the presence of one or more particular SNPs can be used to predict CVD in the subject. Also as described herein, methylation of one or more particular CpG dinucleotides and/or the presence of one or more particular SNPs can be used to estimate survivability of CVD.
In some embodiments, the method further comprises contacting the bisulfite-treated DNA with at least one second oligonucleotide probe at least 8 nucleotides in length that is complementary to a sequence that comprises a CpG dinucleotide, where the at least one second oligonucleotide probe detects either the unmethylated CpG dinucleotide or the methylated CpG dinucleotide, whichever is not detected by the at least one first oligonucleotide probe.
In some embodiments, the ratio of methylated CpG dinucleotides to unmethylated CpG dinucleotides in the biological sample can be determined as a part of the methods described herein. Determining the ratio of methylated CpG dinucleotides to unmethylated CpG dinucleotides can allow for a risk or outcome to be estimated or determined.
It would be appreciated that determining the methylation status of the one or more CpG dinucleotides and determining the presence (or absence) of a SNP can utilize any number of techniques, such as, for example, amplifying and/or sequencing steps. Amplifying and sequencing are well known techniques in the art and are used routinely to determine both the methylation status of a particular sequence and the presence/absence of a SNP.
Methods of determining the presence of biomarkers associated with CHD in a biological sample from a subject are provided. A similar approach can be used for any other form of CVD as well. Such methods typically include providing a first portion of the biological sample and contacting DNA from the first portion with bisulfite under alkaline conditions. The bisulfite-treated first portion can be contacted with a first oligonucleotide probe that is at least 8 nucleotides in length and that is complementary to a sequence that comprises a CpG dinucleotide (detected, e.g., at a CG locus referred to as cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, or another biomarker from Appendix A), and, if necessary or desired, a second portion of the biological sample can be contacted with a nucleic acid probe at least 8 nucleotides in length that is complementary to a SNP (e.g., rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433, or another biomarker from Appendix C).
As described herein, the percentage of methylation of the CpG dinucleotide at one or more of the GC loci designated cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 (or at a CpG dinucleotide that is in linkage disequilibrium with one or more of such CpG dinucleotides) and the identity of the nucleotide at one or more SNPs designated rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 (or at a SNP that is in linkage disequilibrium with one or more of such SNPs) are biomarkers associated with CVD and can be used to predict the likelihood that an individual will develop CVD and/or prognosticate as to the severity of the disease or the outcome (e.g., survival) for the individual.
In addition to the SNP and/or CpG biomarkers identified herein, one or more clinical indicators can be used to aid in either or both diagnostics and prognostics, interventions (e.g., lifestyle, therapeutic, medical) selection or customization or optimization, management, or monitoring. Without limitation, such clinical indicators can include demographics (e.g., age, sex, race); vital signs (e.g., heart rate (beats/min), systolic BP (mm Hg), diastolic BP (mm Hg)); medical history (e.g., smoking, atrial fibrillation/flutter, hypertension, coronary heart disease, myocardial infarction, heart failure, peripheral artery disease, COPD, diabetes (type 1 or type 2), CVA/TIA, chronic kidney disease, hemodialysis, angioplasty (peripheral or coronary), stent (peripheral or coronary), CABG, percutaneous coronary intervention); medications (ACE-I/ARB, beta blocker, aldosterone antagonist, loop diuretics, nitrates, CCB, statin, aspirin, warfarin, clopidogel); coronary computed tomography angiography (e.g., atomic stenosis, FFR-CT, plaque type, total plaque); echocardiographic results (e.g., LVEF (%), RSVP (mm Hg)); stress test results (e.g., ischemia on scan, ischemia on ECG); angiography results (e.g., ≥70% coronary stenosis in ≥2 vessels, ≥70% coronary stenosis in ≥3 vessels); and/or lab measures (e.g., sodium, blood urea nitrogen (mg/dL), creatinine (mg/dL), eGFR (median, CKDEPI), total cholesterol (mg/dL), LDL cholesterol (mg/dL), Ribitol, Hemoglobin, Hematocrit, Triglycerides, Alkaline Phosphatase, HbAlc, HDL-C, Non-HDL-C, ApoB, LDL-P1, HDL-P1, sdLDL-C, VLDL-C, Lp(a), hs-CRP, LpPLA2 Activity, HOMOCYSTEINE, B TYPE NATRIURETIC PEPTIDE, glycohemoglobin (%), glucose (mg/dL), HGB (mg/dL), C-reactive protein (mg/L)), NT-proBNP, KIM-1, osteopontin, TIMP-1, kidney injury molecule-1, N-terminal pro B-type natriuretic peptide, osteopontin, tissue inhibitor of metalloproteinase-1, Uridine, Carotene-3, Ribitol, 1-stearoyl-2-adrenoyl-GPC, N-acetyl-isoputreanine, Lysophospatidylcholine, Vanillactate acid, 3-ureidopropionate, Serum paraoxonase, Bone morphogenetic protein 1, Carboxypeptidate B2, Albumin, Histone H2B type 1-K, Versican core protein, Insulin-like growth factor-binding protein 2, Matrix-remolding associated protein 5. etc.).
Kits for Detecting Cardiovascular Disease (CVD)In a further embodiment of the disclosure, articles of manufacture and kits containing probes, oligonucleotides and/or antibodies are provided. Such articles of manufacture can be used in the methods described herein. An article of manufacture can include one or more containers with, for example, a label. Suitable containers include, for example, bottles, vials, and test tubes. The containers can be formed from a variety of materials such as glass or plastic. The container can hold a composition that includes one or more agents that are effective for practicing the methods described herein. The label on the container indicates that the composition can be used for a specific application. The kit of the disclosure will typically comprise the container described above and one or more other containers comprising materials desirable from a commercial and user standpoint, including buffers, diluents, filters, and package inserts with instructions for use.
In certain embodiments, the present disclosure provides a kit for determining the methylation status of at least one CpG dinucleotide and, in some cases, also the presence of at least one single-nucleotide polymorphism (SNP). In certain embodiments, a kit as described herein may contain a number of primers that is any integer between 1 and 10,000, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, . . . 9997, 9998, 9999, 10,000. As used herein, the term “nucleic acid primer” or “nucleic acid probes” or “oligonucleotide” encompasses both DNA and RNA sequences. In certain embodiments, the primers or probes may be physically located on a single solid substrate or on multiple substrates.
A kit as described herein can include at least one first nucleic acid primer (e.g., at least 8 nucleotides in length) that is complementary to a bisulfite-converted nucleic acid sequence comprising a CpG dinucleotide (detected, e.g., at a GC locus referred to as cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584), and, in some instances, at least one second nucleic acid primer (e.g., at least 8 nucleotides in length) that is complementary to a SNP (e.g., rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433). The at least one first nucleic acid primer can detect the methylated or unmethylated CpG dinucleotide.
It would be appreciated that any of the nucleic acid primers, probes or oligonucleotides described herein can include one or more nucleotide analogs and/or one or more synthetic or non-natural nucleotides.
It also would be appreciated that the kits described herein can include a solid substrate. In some embodiments, one or more of the nucleic acid primers can be bound to the solid support. Examples of solid supports include, without limitation, polymers, glass, semiconductors, papers, metals, gels, or hydrogels. Additional examples of solid supports include, without limitation, microarrays, or microfluidics cards.
It also would be appreciated that any of the kits described herein can include one or more detectable labels. In some embodiments, one or more of the nucleic acid primers can be labeled with the one or more detectable labels. Representative detectable labels include, without limitation, an enzyme label, a fluorescent label, and a colorimetric label.
Algorithm for Predicting Cardiovascular Disease (CVD) or Estimating Survivability from CVD
Any number of algorithms can be used including, without limitation, statistical algorithms (e.g., linear regression, logistic regression, proportional hazard models, etc.), machine learning algorithms (e.g., Random Forest, Gradient Boosting, Support Vector Machines, Neural Networks (e.g., deep neural network, extreme learning machine (ELM)), Bayes classifiers, Hidden Markov model, etc.), deep learning algorithms (e.g., convolutional neural networks, recurrent neural networks, autoencoders, large language model, etc.), time series algorithms (e.g., ARIMA, etc.), Bayesian model algorithms (e.g., Bayesian Networks, etc.), and/or financial algorithms (e.g., decision tree, discrete event simulation, budget impact, etc.). See, for example, McKinney et al., 2011, Appl. Bioinform., 5(2):77-88; Gunther et al., 2012, BMC Genet., 13:37; and Ogutu et al., 2011, BMC Proceedings, 5(Suppl 3):S11. Any type of machine learning algorithm or deep learning neural network algorithm (tuned or non-tuned) capable of capturing linear and/or non-linear contribution of traits for the prediction can be used. In some instances, a combination of algorithms (e.g., a combination or ensemble of multiple algorithms that capture linear and/or non-linear contributions of traits) is used.
Furthermore, algorithm(s) can implement any one or more of: a regression algorithm, an instance-based method (e.g., k-nearest neighbor, learning vector quantization, self-organizing map, etc.), a regularization method, a decision tree learning method (e.g., classification and regression tree, chi-squared approach, Random Forest approach, multivariate adaptive approach, gradient boosting machine approach, etc.), a Bayesian method (e.g., naïve Bayes, Bayesian belief network, etc.), a kernel method (e.g., a support vector machine, a linear discriminant analysis, etc.), a clustering method (e.g., k- means clustering), an associated rule learning algorithm (e.g., an a priori algorithm), an artificial neural network model (e.g., a back-propagation method, a Hopfield network method, a learning vector quantization method, etc.), a deep learning algorithm (e.g., a Boltzmann machine, a convolution network method, a stacked auto-encoder method, etc.), a dimensionality reduction method (e.g., principal component analysis, partial least squares regression, etc.), an ensemble method (e.g., boosting, boot strapped aggregation, gradient boosting machine approach, etc.), and any other suitable algorithm.
Simply by way of example, Random Forest™ is a popular machine learning algorithm created by Breiman & Cutler for generating “classification trees” (see, for example, “stat.berkeley.edu/˜breiman/RandomForests/cc_home.htm” on the World Wide Web). Using standard machine learning and predictive modeling techniques, a diagnostic classifier algorithm was written to be implemented in R and Python programming languages (though it can be implemented in many other programming languages), according to well described guidelines by Breiman & Cutler. A diagnostic classifier algorithm was generated using data from at least two traits (T) and the diagnosis of interest from that population. To determine the output (e.g., diagnosis) for a new individual, one simply determines values for the at least two traits (T) and inputs that information into an algorithm (e.g., the diagnostic classifier algorithm described herein or another algorithm discussed above) that is capable of capturing the linear and non-linear contributions of the traits.
Prior to fitting model(s), input data can be conditioned or otherwise pre-processed, such that conditioned data elements (e.g., genomic reads associated with loci of interest, functional data, sensor data, other lifestyle data, etc.) are suitable for further processing. Conditioning, as described herein, can include filtering of data (e.g., sensor data outputs that have confidence values below a threshold etc.). Pre-processing step(s) can be taken on the model input(s) such as dimensionality reduction (e.g., principal component analysis, linear discriminant analysis, auto encoders, uniform manifold approximation and projection, partial least squares regression, etc.) before training model(s).
The model prediction(s) (e.g., risk score, diagnostics, etc.) uncertainty can be computed. The methods for estimating may include bootstrap methods, Bayesian methods, Monte Carlo dropout, ensemble methods, sensitivity analysis etc. The uncertainty of the method and/or model described herein can be aggregated to provide overall system uncertainty. Uncertainty quantification can encompass all aspects of marker measurements. For instance, in epigenetic measurement, uncertainty quantification may include, but is not limited to, sampling error, reagent quality, instrumentation, and human error.
The uncertainty quantification mentioned herein, which can be used to perform quality control. For example, known reference sample(s) and/or measurement(s) can be compared against the new measurement(s). The difference between the known and new measurement(s), along with their uncertainty, can be used to evaluate error(s) and/or uncertainty(ies) and how they compare to a defined acceptable threshold. This process assists in identifying sources of errors and/or uncertainties and minimizes such error(s) and/or uncertainty(ies) to ensure the reliability of the measurement(s). For example, it can be used to determine if a sample(s) requires re-measurement(s) or re-collection(s) to meet a defined acceptable threshold for the measurement(s) and/or marker(s).
Returned classification regression and/or other outputs of model(s) can include returned confidence-associated parameters in such classifications. In particular, confidence-associated parameters can have a score (e.g., percentile, other score) that indicates confidence in the returned output. The confidence may be estimated by aggregating measurement and or modeling uncertainties. Transforming output data also can be used to enhance the expandability of the models described previously. For example, SHapley Additive exPlanations (SHAP), Local Interpretable Model-agnostics Explanations, Integrated Gradients, Partial Dependence Plots, Global Surrage Models, etc., can be used to enhance expandability of the output for users. One or more of these approaches can be utilized concurrently. In specific examples, SHAP can be utilized to identify the most significant contributing marker(s) to conditions or indication.
Additionally, or alternatively, dynamic aspects (e.g., changes over time in markers, changes in frequency between instances of respective features, other temporal aspects, other frequency-related aspects, etc.) of features derived from the samples can be used to predict or otherwise anticipate health condition statuses to generate personalized intervention plans.
Samples can be collected once (e.g., at a single time point), or at a number of time points (e.g., at random points, at regular points, in relation to triggering events, with other frequency, etc.).
As described herein, the inputs can be at least one genotype (e.g., SNP) and/or the methylation status of at least one CpG dinucleotide and/or other data, and the outcome can represent a positive or a negative probability for CVD, however, severity also can be evaluated. The Traits (T) used to determine the outcome can represent the methylation status of at least one CpG dinucleotide or at least one genotype (e.g., of a SNP), but Traits (T) also can correspond to at least one interaction (e.g., between methylation status and genotype (CpG×SNP), between the methylation status of two different sites (CpG×CpG) or between two different genotypes (SNP×SNP)). It would be appreciated that any such interactions can be visualized using partial dependence plots.
The inputs also can be data or information (e.g., a dataset) including, without limitation, clinical diagnoses; demographics (e.g., gender, race); lifestyle; imaging (e.g., cardiac CT scan, cardiac MRI, coronary angiogram); features derived from imaging (e.g., FFR from CT scan, percent stenosis); results from electrocardiogram or echocardiogram test; results from stress tests; blood tests (e.g., for metabolic assays, genetic, epigenetic, protein, etc.); blood pressure; results from a carotid ultrasound; and combinations thereof.
A dataset can include, without limitation, data derived from one or more of: body weight (e.g., receiving bodyweight values of the patients generated from a digital weighing scale), body fat percent, muscle mass, body water, height or other length measurements (e.g., via a ruler or measuring tape), other body mass index (BMI)-associated parameters, blood chemical and biochemical information, inflammatory markers, fasting blood sugar, high density lipids, low density lipids, blood interleukins, c-reactive protein, blood cell counts, electrophysiology signals (e.g., electroencephalogram signals, electromyography signals, galvanic skin response signals, electrocardiogram signals, etc.), heart rate, body temperature, cardiovascular parameters, continuous glucose monitoring (glycemic response), respiration parameters (e.g., respiration rate, depth/shallowness of breath, etc.), blood oxygenation signals, motion parameters, and any other suitable physiologically relevant parameter of the patient. Additionally, or alternatively, a dataset can include data derived from one or more of: electronic health records, health plan claims, questionnaires, survey, wearables, public sources (e.g., repositories) and any other direct or indirect data relevant to an individual. Additionally, or alternatively, a dataset can include data that is raw, imputed, transformed, longitudinal, cross-sectional or temporal.
In the illustrated example, a subject 101 provides a subject sample 102. In some embodiments, the subject sample 102 can be a blood sample, a saliva sample, a mucus sample, a urine or stool sample, or any other appropriate biological sample from the subject 101. In some embodiments, medical personnel 103 (e.g., a doctor, a nurse, a lab technician, a caregiver) may assist the subject 101 with obtaining the subject sample 102. In some embodiments, the subject 101 may obtain the subject sample 102 from herself or himself (e.g., by using a portable blood sampling device or a home collection kit).
A nucleic acid isolation module 110 isolates a nucleic acid sample 112 from the subject sample 102. In some embodiments, the nucleic acid isolation module 110 can be a manual, semi-automated, or automatic process that perform or more of cell lysis, removal of contaminating proteins, deactivating DNAases and/or RNAases, and recovery of DNA and/or RNA. For example, the nucleic acid isolation module 110 can be a part of an automated process or analysis device configured to isolate the nucleic acid sample 112 from the subject sample 102. In another example, the nucleic acid isolation module 110 can be part of one or more of the example kits described in this document, to be used by a human user such as the medical personnel 103.
A genotyping assay module 120 receives a portion 114a of the nucleic acid sample 112. The genotyping assay module 120 is configured to perform a genotyping assay on the portion 114a of the nucleic acid sample 112 to detect the presence of at least one SNP, wherein the at least one SNP is a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C and/or is a second SNP in linkage disequilibrium (R>0.3) with a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C to determine, identify, or otherwise obtain a collection of genotype data 122. In some embodiments, the genotyping assay module 120 can be a manual, semi-automated, or automatic process. For example, genotyping assay module 120 can be a part of an automated process or analysis device configured to perform a genotyping assay on the portion 114a. In another example, the genotyping assay module 120 can be part of one or more of the example kits described in this document, to be used by a human user such as the medical personnel 103 or a laboratory technician.
A methylation assay module 130 receives a portion 114b of the nucleic acid sample 112. The methylation assay module 130 is configured to bisulfite convert the nucleic acid in the portion 114b of the nucleic acid sample 112 and perform methylation assessment on the portion 114b of the nucleic acid sample 112 to detect methylation status of at least one CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and/or a CpG site collinear (R>0.3) with a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A to determine, identify, or otherwise obtain a collection of methylation data 132.
An identification system 140 is configured to receive the collection of genotype data 122 and the collection of methylation data 132 and identify one or more predetermined traits or characteristics of the subject 101 based on a diagnostic classifier algorithm module 142. The diagnostic classifier algorithm module 142 is configured to account for at least one SNP main effect and/or at least one CpG main effect and/or at least one interaction effect. In some embodiments, the diagnostic classifier algorithm module 142 can perform one or more of the algorithms described herein that may indicate the presence of disease (e.g., diagnostic indicators) or a propensity to develop disease (e.g., predict) or the severity of disease or the selection, customization, or optimization of one or more intervention(s) (e.g., lifestyle, medical, therapeutic) or effectiveness of management or the monitoring of disease, severity or risk. For example, the identification system may be configured to identify genetic and/or environmental characteristics that determines the presence of or the likelihood of a subject developing disease (e.g., cardiovascular disease), even when the disease is of polygenic origin. In some implementations, the diagnostic classifier algorithm module 142 can be a machine learning algorithm capable of accounting for linear and non-linear effects.
The identification system 840 provides an output 150 based on the diagnostic and/or prognostic indicators provided by the diagnostic classifier algorithm module 142. In some embodiments, the identification system 140 can include an output module configured to provide the output 150. In some implementations, the output 150 can be an identification of one or more diseases that the subject 101 may already have. For example, the output 150 may indicate that traits that are indicative of the presence of cardiovascular disease were found in the subject 101. In some implementations, the output 150 can be an indication of a likelihood that the subject 101 may develop a disease within a predetermined time frame (e.g., the subject 101 may have a 43% chance of developing cardiovascular disease within 3 years, the subject 101 may have a 77% of having a heart attack within 2 years). In some implementations, the output 150 can include therapeutic and/or preventative recommendations based on the diagnostic and/or prognostic indicators provided by the diagnostic classifier algorithm module 142. For example, in response to an identification or prediction of a diabetic or cardiac condition in the subject 101, the output 150 may include a recommendation to consult with the medical personnel 103, identify possible dietary or lifestyle changes by the subject 101 to address or avoid the condition, identify potential interventions and/or remedies for the subject 101 to consider in consultation with the medical personnel 103, or combinations of these and/or any other appropriate information based on the output of the algorithm(s) of the diagnostic classifier algorithm module 142. In some instances, the output 150 can be an estimate of survivability of a subject at risk for or that has been determined to have CVD.
In the illustrated example, the output 150 is provided in various formats. The information provided by the output 150 can be formatted into a message 160 that is provided to the subject 101 and/or to the medical personnel 103. In some implementations, the message 160 can be formatted as a report (e.g., a word processing file, a portable document format file) that is at least temporarily stored on a non-transitory storage medium (e.g., a hard drive, a FLASH memory), where it can be retrieved by the subject 101 and/or the medical personnel 103 for review. In some implementations, the message 160 can be formatted as an electronic message (e.g., an email, a text message, an instant message) that is transmitted to the subject 101 and/or the medical personnel 103 for review. In some implementations, the message 160 can be a printed report. For example, the output 150 can be provided to a printing system that is configured to generate a hard copy report based on the output 150. Subsequent automated or manual processing systems can package the report as a letter or other parcel that can be sent for physical delivery to the subject 101 and/or to the medical personnel 103 (e.g., the system 100 can created a paper printout the results and mail them through postal mail).
A treatment device 170 can be configured to receive the diagnostic and/or prognostic indicators provided by the output 150 and provide interventions (e.g., lifestyle, therapeutic, medical) based on the diagnostic and/or prognostic indicators. For example, the output 150 may indicate that the subject 101 has a high likelihood of suffering cardiac arrest within the next two years, and the treatment device 170 may be a drug (e.g., a tablet or capsule) or an implantable drug delivery system that reacts by identifying or by receiving configuration settings for an appropriate dosage of a statin, acetylsalicylic acid (aspirin), an anti-inflammatory drug, a blood thinner, or combinations of these and/or any other appropriate therapeutic and/or preventative substances. In some embodiments, the treatment device 170 can be configured to also include one or more of the nucleic acid isolation module 110, the genotyping assay module 120, the methylation assay module 130, or the identification system 140.
A storage system 180 is configured to store the output 150. For example, the information included in the output 150 can be stored temporarily, for a predetermined period of time, or substantially permanently in a database, in a file, or as any other appropriate collection of data. In some embodiments, the storage system 180 can store the output 150 in a non-transitory storage medium (e.g., a hard drive, a FLASH memory). For example, the output 150 may include some or all of the collection of genotype data 122, the collection of methylation data 132, and/or the output 150 in personal health record that the subject 101 can store or carry with them. In some embodiments, the storage system 180 can store the output 150 as a physical medium, for example, the storage system 180 can include a printer that can generate a paper report based on the output 150, and/or store the report as a hard copy that can be physically filed away for later retrieval.
An input/output device 182 is physical device configured to display or otherwise present an output that is perceptible to humans (e.g., the subject 101, the medical personnel 103). For example, the input/output device 182 may be an electronic display device in a doctor's office. The system 100 may process the subject sample 102, and then alter the configuration of pixels onscreen to modify the information displayed by the input/output device 182 based on the output 150 (e.g., a screen can be updated to display an identified diagnosis and/or prognosis for the subject 101 to the medical personnel). In another example, the input/output device 182 can be configured to provide audible (e.g., spoken output) and/or tactile (e.g., braille, haptic, vibratory) output that modifies or otherwise transforms the output 150 into a physical and/or tangible output (e.g., to convey the diagnostic and/or prognostic indicators in a manner that is perceptible to a user who is sight-challenged). In another example, the input/output device 182 can be configured to alter, transform, or modify a physical characteristic of a physical structure or medium based on the output 150.
A user device 184 (e.g., a computer, a smartphone, a tablet computer, a computerized terminal) is configured to display, emit, or otherwise present one or more outputs that are perceptible to a human user, such as the subject 101 and/or the medical personnel 103. For example, the user device 184 can receive the output 150 (e.g., as data, as the message 160) and provide an alert to the user and/or provide an output (e.g., display a report, read a report aloud) based on the output 150. In some embodiments, the user device 184 can include one or more of the storage device 180 or the input/output device 182. In some embodiments, the user device 182 can be part of the treatment device 170. In some embodiments, the user device 184 can be configured to include one or more of the nucleic acid isolation module 110, the genotyping assay module 120, the methylation assay module 130, or the identification system 140.
In some implementations, some or all of the system 100 may be reused to provide additional information. For example, the system 100 may be used to gather an initial set of health information for the subject 101 and/or identify information that can assist the medical personnel 103 with an initial diagnosis/prognosis. Later, the patent 101 may be re-examined using the system 100, for example, to determine the effectiveness of prescribed medical and/or lifestyle strategies over time. Since the collection of genotype data 122 does not change over time for an individual person, the system 100 may refrain from performing the functions of the genotyping assay module 120 again. In such examples, the methylation assay module 130 may be used to generate an updated version of the collection of methylation data 132, and the updated collection of methylation data 132 can be provided to the identification system 140 for processing along with the collection of genotype data 122 that was previously generated. In some implementations, the subject sample 102 can be collected on a periodic basis and processed based on the existing collection of genotype data 122 and updated collections of methylation data 132 to produce updated outputs 150 that can be used to provide ongoing monitoring of one or more conditions identified for the subject 101.
At 210, a nucleic acid sample is isolated from a subject sample. For example, the example nucleic acid isolation module 110 can be configured to isolate and/or substantially purify nucleic acid compositions from the example subject sample 102 to produce the example nucleic acid sample 112.
At 220, a genotyping assay is performed on a first portion of the nucleic acid sample to detect the presence of at least one SNP, wherein the at least one SNP is a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C and/or is a second SNP in linkage disequilibrium (R>0.3) with a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C to obtain genotype data. For example, the example genotyping assay module 120 could be used to analyze the example portion 114a of the nucleic acid sample 112 to produce the example collection of genotype data 122.
At 230, a second portion of the nucleic acid sample is bisulfite converted, and a methylation assessment is performed on the second portion of the nucleic acid sample to detect methylation status of at least one CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and/or a CpG site collinear (R>0.3) with a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A to obtain methylation data. For example, the example methylation assay module 130 can be used to process the portion 114b of the nucleic acid sample 112 to produce the example collection of methylation data 132.
At 240, the genotype data from step 220 and/or methylation data from step 230 is input into an algorithm. For example, the example collection of genotype data 122 and the example collection of methylation data 132 are input into the example identification system 140 and processed using the example diagnostic classifier algorithm module 142.
At 250, at least one SNP main effect and/or at least one CpG main effect and/or at least one interaction effect are accounted for. For example, the example diagnostic classifier algorithm module 142 can be configured to account for at least one SNP main effect and/or at least one CpG main effect and/or at least one interaction effect. In some implementations, the diagnostic classifier algorithm module 142, can be a machine learning algorithm capable of accounting for linear and non-linear effects.
At 260, an output is provided. For example, the example identification system 140 can provide the example output 150.
At 270 another nucleic acid sample is isolated from another sample from the subject. For example, the example nucleic acid isolation module 110 can be configured to isolate and/or substantially purify nucleic acid compositions from another sample to produce another example nucleic acid sample. Since the collection of genotype data 122 from a subject does not change over time, the newly produced nucleic acid sample can be used to obtain methylation data 132, which is used along with the existing collection of genotype data 122 to provide an updated output (e.g., to perform a checkup on the subject 101 at a later point in time). In some implementations, this abbreviated process can be performed on a periodic or semi-periodic basis to provide ongoing monitoring of one or more medical conditions identified for the subject 101.
Computing device 300 includes a processor 302, a memory 304, a storage device 306, a high-speed interface 308 connecting to memory 304 and high-speed expansion ports 310, and a low-speed interface 312 connecting to a low-speed bus 314 and storage device 306. Each of the components 302, 304, 306, 308, 310, and 312, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 302 can process instructions for execution within the computing device 300, including instructions stored in the memory 304 or on the storage device 306 to display graphical information for a GUI on an external input/output device, such as display 316 coupled to high-speed interface 308. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 300 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
The memory 304 stores information within the computing device 300. In one implementation, the memory 304 is a computer-readable medium. In one implementation, the memory 304 is a volatile memory unit or units. In another implementation, the memory 304 is a non-volatile memory unit or units.
The storage device 306 can provide mass storage for the computing device 300. In one implementation, the storage device 306 is a computer-readable medium. In various implementations, the storage device 306 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 304, the storage device 306, or memory on processor 302.
The high-speed controller 308 manages bandwidth-intensive operations for the computing device 300, while the low-speed controller 312 manages lower bandwidth-intensive operations. Such allocation of duties is exemplary only. In one implementation, the high-speed controller 308 is coupled to memory 304, display 316 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 310, which may accept various expansion cards (not shown). In the implementation, low-speed controller 312 is coupled to storage device 306 and low-speed expansion port 317 through the low-speed bus 314. The low-speed expansion port, which may include various communication ports (e.g., Universal Serial Bus (USB), BLUETOOTH, BLUETOOTH Low Energy (BLE), Ethernet, wireless Ethernet (WiFi), High-Definition Multimedia Interface (HDMI), ZIGBEE, visible or infrared transceivers, Infrared Data Association (IrDA), fiber optic, laser, sonic, ultrasonic) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, a networking device such as a gateway, modem, switch, or router, e.g., through a network adapter 313.
Peripheral devices can communicate with the high-speed controller 308 through one or more peripheral interfaces of the low-speed controller 312, including but not limited to a USB stack, an Ethernet stack, a WiFi radio, a BLUETOOTH Low Energy (BLE) radio, a ZIGBEE radio, an HDMI stack, and a BLUETOOTH radio, as is appropriate for the configuration of a sensor. For example, a sensor that outputs a reading over a USB cable can communicate through a USB stack.
The network adapter 313 can communicate with a network 315. Computer networks typically have one or more gateways, modems, routers, media interfaces, media bridges, repeaters, switches, hubs, Domain Name Servers (DNS), and Dynamic Host Configuration Protocol (DHCP) servers that allow communication between devices on the network and devices on other networks (e.g., the Internet). One such gateway can be a network gateway that routes network communication traffic among devices within the network and devices outside of the network. One common type of network communication traffic that is routed through a network gateway is a Domain Name Server (DNS) request, which is a request to the DNS to resolve a uniform resource locator (URL) or uniform resource indicated (URI) to an associated Internet Protocol (IP) address.
The network 315 can include one or more networks. The network(s) may provide for communications under various modes or protocols, such as Global System for Mobile communication (GSM) voice calls, Short Message Service (SMS), Enhanced Messaging Service (EMS), or Multimedia Messaging Service (MMS) messaging, Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Personal Digital Cellular (PDC), Wideband Code Division Multiple Access (WCDMA), CDMA2000, General Packet Radio System (GPRS), or one or more television or cable networks, among others. For example, the communication may occur through a radio-frequency transceiver. In addition, short-range communication may occur, such as using a BLUETOOTH, BLE, ZIGBEE, WiFi, IrDA, or other such transceiver.
In some embodiments, the network 315 can have a hub-and-spoke network configuration. A hub-and-spoke network configuration can allow for an extensible network that can accommodate components being added, removed, failing, and replaced. This can allow, for example, more, fewer, or different devices on the network 315. For example, if a device fails or is deprecated by a newer version of the device, the network 315 can be configured such that network adapter 313 can be updated about the replacement device.
In some embodiments, the network 315 can have a mesh network configuration (e.g., ZIGBEE). Mesh configurations may be contrasted with conventional star/tree network configurations in which the networked devices are directly linked to only a small subset of other network devices (e.g., bridges/switches), and the links between these devices are hierarchical. A mesh network configuration can allow infrastructure nodes (e.g., bridges, switches, and other infrastructure devices) to connect directly and non-hierarchically to other nodes. The connections can dynamically self-organize and self-configure to route data. By not relying on a central coordinator, multiple nodes can participate in the relay of information. In the event of a failure of one or more of the nodes or the communication links between then, the mesh network can self-configure to dynamically redistribute workloads and provide fault-tolerance and network robustness.
The computing device 300 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 320, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 324. It may also be implemented as part of network device such a modem, gateway, router, access point, repeater, mesh node, switch, hub, or security device (e.g., camera server). In addition, it may be implemented in a personal computer such as a laptop computer 322. Alternatively, components from computing device 300 may be combined with other components in a mobile device (not shown), such as device 350. In some embodiments, the device 350 can be a mobile telephone (e.g., a smartphone), a handheld computer, a tablet computer, a network appliance, a camera, an enhanced general packet radio service (EGPRS) mobile phone, a media player, a navigation device, an email device, a game console, an interactive or smart television, a media streaming device, or a combination of any two or more of these data processing devices or other data processing devices. In some implementations, the device 350 can be included as part of a motor vehicle (e.g., an automobile, an emergency vehicle (e.g., fire truck, ambulance), a bus). Each of such devices may contain one or more of computing device 300, 350, and an entire system may be made up of multiple computing devices 300, 350 communicating with each other through a low-speed bus or a wired or wireless network.
Computing device 350 includes a processor 352, memory 364, an input/output device such as a display 354, a communication interface 366, and a transceiver 368, among other components. The device 350 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage. Each of the components 350, 352, 364, 354, 366, and 368, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
The processor 352 can process instructions for execution within the computing device 350, including instructions stored in the memory 364. The processor may also include separate analog and digital processors. The processor may provide, for example, for coordination of the other components of the device 350, such as control of user interfaces, applications run by device 350, and wireless communication by device 350.
Processor 352 may communicate with a user through control interface 358 and display interface 356 coupled to a display 354. The display 354 may be, for example, a TFT LCD display or an OLED display, or other appropriate display technology. The display interface 356 may comprise appropriate circuitry for driving the display 354 to present graphical and other information to a user. The control interface 358 may receive commands from a user and convert them for submission to the processor 352. In addition, an external interface 362 may be provide in communication with processor 352, so as to enable near area communication of device 350 with other devices. External interface 362 may provide, for example, for wired communication (e.g., via a docking procedure) or for wireless communication (e.g., via Bluetooth or other such technologies).
The memory 364 stores information within the computing device 350. In one implementation, the memory 364 is a computer-readable medium. In one implementation, the memory 364 is a volatile memory unit or units. In another implementation, the memory 364 is a non-volatile memory unit or units. Expansion memory 374 may also be provided and connected to device 350 through expansion interface 372, which may include, for example, a SIMM card interface. Such expansion memory 374 may provide extra storage space for device 350 or may also store applications or other information for device 350. Specifically, expansion memory 374 may include instructions to carry out or supplement the processes described above and may include secure information also. Thus, for example, expansion memory 374 may be provide as a security module for device 350 and may be programmed with instructions that permit secure use of device 350. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
The memory may include for example, flash memory and/or MRAM memory, as discussed below. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 364, expansion memory 374, or memory on processor 352.
Device 350 may communicate wirelessly through communication interface 366, which may include digital signal processing circuitry where necessary. Communication interface 366 may provide for communications under various modes or protocols, such as GSM voice calls, Voice Over LTE (VOLTE) calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, GPRS, WiMAX, LTE, 5G, among others. Such communication may occur, for example, through radio-frequency transceiver 368. In addition, short-range communication may occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown) configured to provide uplink and/or downlink portions of data communication. In addition, GPS receiver module 370 may provide additional wireless data to device 350, which may be used as appropriate by applications running on device 350.
Device 350 may also communication audibly using audio codec 360, which may receive spoken information from a user and convert it to usable digital information. Audio codex 360 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 350. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 350.
The computing device 350 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 380. It may also be implemented as part of a smartphone 382, personal digital assistant, or other similar mobile device.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” “computer-readable medium” refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
Some communication networks can be configured to carry power as well as information on the same physical media. This allows a single cable to provide both data connection and electric power to devices. Examples of such shared media include power over network configurations in which power is provided over media that is primarily or previously used for communications. One specific embodiment of power over network is Power Over Ethernet (POE) which pass electric power along with data on twisted pair Ethernet cabling. Examples of such shared media also include network over power configurations in which communication is performed over media that is primarily or previously used for providing power. One specific embodiment of network over power is Power Line Communication (PLC) (also known as power-line carrier, power-line digital subscriber line (PDSL), mains communication, power-line telecommunications, or power-line networking (PLN), Ethernet-Over-Power (EOP)) in which data is carried on a conductor that is also used simultaneously for AC electric power transmission.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
The computing system can include routers, gateways, modems, switches, hub, bridges, and repeaters. A router is a networking device that forwards data packets between computer networks and performs traffic directing functions. A network switch is a networking device that connects networked devices together by performing packet switching to receive, process, and forward data to destination devices. A gateway is a network device that allows data to flow from one discrete network to another. Some gateways can be distinct from routers or switches in that they can communicate using more than one protocol and can operate at one or more of the seven layers of the open systems interconnection model (OSI). A media bridge is a network device that converts data between transmission media so that it can be transmitted from computer to computer. A modem is a type of media bridge, typically used to connect a local area network to a wide area network such as a telecommunications network. A network repeater is a network device that receives a signal and retransmits it to extend transmissions and allow the signal can cover longer distances or overcome a communications obstruction.
It will be apparent that the present disclosure provides a skilled artisan the ability to construct a matrix in which the methylation status of one or more CpG dinucleotides and/or one or more genotypes (e.g., SNPs; e.g., at one or more alleles) can be evaluated as described herein, typically using a computer, to identify interactions and allow for prediction of the presence or incidence of CVD. Although such an analysis is complex, no undue experimentation is required as all necessary information is either readily available to the skilled artisan or can be acquired by experimentation as described herein.
Methods of Treating, Managing and/or Monitoring Cardiovascular Diseases
The present disclosure provides methods for determining the likelihood that a subject has CVD, methods for monitoring a subject for CVD (e.g., progression of disease), methods of determining the severity of the CVD (e.g., degree of obstruction), and/or estimating the survival of a subject at risk for or who has CVD. As used herein, CVD includes, without limitation, CHD, stroke, arrhythmia, cardiac arrest, and congestive heart failure. The methods and compositions described herein provide a better ability to assess a subject's risk for or monitor the presence of cardiovascular disease, which is the first step toward more effective prevention. In addition, the methods and compositions described herein provide the ability to estimate the survival of a subject at risk for or who has been determined to have CVD, which can allow for therapies and/or lifestyle changes that may prolong or extend the survival of the subject.
Upon making a positive prognosis of a cardiac outcome (e.g., a prognosis of cardiovascular death, myocardial infarct (MI), stroke, all cause death, or a composite thereof), a medical practitioner can advantageously use the prognostic information thereby obtained to identify the need for, or to customize or optimize, an intervention in the subject, such as, for example, stress testing with ECG response or myocardial perfusion imaging, coronary computed tomography angiogram, diagnostic cardiac catheterization, percutaneous coronary (e.g., balloon angioplasty with or without stent placement), coronary artery bypass graft (CABG), enrollment in a clinical trial, and administration or monitoring of effects of agents selected from, but not limited to, of agents selected from nitrates, beta blockers, ACE inhibitors, antiplatelet agents and lipid-lowering agents. In addition, a medical practitioner can advantageously use the prognostic information thereby obtained to make recommendations for lifestyle changes including, without limitation, diet modification, exercise regimens, smoking and/or drinking cessation, and combinations thereof. Using the information provided by the methods described herein, a medical practitioner can manage the interventions and monitor an individual to observe, e.g., an improvement.
Those identified as being at higher risk (e.g., PPV of 69% for CVD) or as having the disease can be followed up promptly for further testing or more aggressive interventions. Conversely, those at lower risk can be re-tested periodically and monitored to ensure continued prevention due to the dynamic nature of DNA methylation.
Interventions for cardiovascular disease can depend on the type of cardiovascular disease and the symptoms the individual is experiencing. Interventions for cardiovascular disease can be preventative, therapeutic or palliative. Treatments for cardiovascular diseases can include, for example, lifestyle changes (e.g., diet (e.g., low fat diet), weight loss, exercise, reduction or cessation in smoking and/or drinking), therapeutics (e.g., beta blockers, statins, calcium channel blocker, ACE inhibitors, vasodilator, alteplase, small molecule modulators, pre-/pro-/syn-/post-biotics), medical interventions (e.g., angioplasty, bypass surgery, implantable device, endarterectomy), gene therapy, gene editing, base editing, epigenetic therapy, epigenetic silencing, and/or epigenetic editing.
In accordance with the present disclosure, there may be employed conventional molecular biology, microbiology, biochemical, and recombinant DNA techniques within the skill of the art. Such techniques are explained fully in the literature. The invention will be further described in the following examples, which do not limit the scope of the methods and compositions of matter described in the claims.
EXAMPLES Example 1—Materials and MethodsThis study features data and/or biomaterial from three sources. The first set of anonymized genome-wide genetic, genome-wide DNA methylation and clinical data are from the Framingham Heart Study (FHS) Offspring cohort, the second set of anonymized clinical data and DNA are from the Intermountain Healthcare (IM) biorepository, and the third set is from an Iowa cohort as described in more detail below. The procedures and protocols used for the analysis of the FHS data and the Iowa cohort were approved by the University of Iowa Institutional Review Board (IRB #201503802 and IRB #201910834), and the procedures and protocols used for the analyses of the IM materials were approved by the Intermountain Healthcare Institutional Review Board (IRB #1024811).
Example 2—Framingham Heart Study (FHS) Offspring CohortThe details on the collection and preparation of clinical and biological data of the FHS cohort have been described previously (dbGAP study accession: phs000007). In brief, the demographics, risk factors and clinical information were derived from the Offspring cohort, including coronary heart disease (CHD) status. CHD was considered present if an individual was diagnosed with CHD. Conversely, CHD was considered absent if an individual was not diagnosed with CHD. Sources of clinical data in determining CHD events included subject report, review of medical records, and death certificates. The designations and dates of CHD onset used in this study are as determined by a panel of three investigators on the Framingham Endpoint Review Committee, but could similarly be applied to other CVDs.
Genome-wide DNA methylation data profiled using the Illumina Infinium HumanMethylation450 BeadChip array (San Diego, CA, USA) was available from 2,567 subjects who were phlebotomized. Standard sample and probe level quality control were performed as described in previous studies, which resulted in retaining 2,560 samples and DNA methylation data from 403,192 loci (see, e.g., Dogan et al., 2018, Genes, 9:641; Pidsley et al., 2013, BMC Genomics, 14:1-10; Triche, 2014, FDb.InfiniumMethylation.hg19: Annotation package for Illumina Infinium DNA methylation probes. Vol. R package version 2.2.0; Davis et al., 2018, Handle Illumina methylation data., Vol. R package version 2.22.0; and Dogan et al., 2018, PLoS One, 13:e0190549). Genome-wide genotype data obtained using the Affymetrix GeneChip HumanMapping 500K array (Santa Clara, CA, USA) was available for 2,406 of the remaining samples. After standard sample and probe level quality control procedures were performed in PLINK on the array data as described previously, the total number of samples and SNPs remaining were 2,295 and 472,822, respectively (Dogan et al., 2018, Genes, 9:641; Dogan et al., 2018, PLoS One, 13:e0190549; and Purcell et al., 2007, Am. J. Hum. Genet., 81:559-75). Based on the number of those diagnosed with CHD (cases) and those that were not diagnosed with or did not have a CHD event within four years of examination (controls), the total number of subjects was 2,111. The demographics and conventional risk factors of these individuals are summarized in Table 1.
The first independent validation cohort consisted of 252 subjects from the Intermountain Healthcare (IM) Heart Institute INSPTRE registry who underwent coronary angiography. A CHD case subject was defined as an adult >18 years old whom did not have a history of CHD or myocardial infarction (MI) prior to the index coronary angiogram but had a clinical diagnosis of CHD (>70% stenosis) on angiography. A control subject was defined as an adult >18 years old whom did not have a history of CHD or myocardial infarction (MI) prior to the index coronary angiogram, had no clinical diagnosis of CHD (<50% stenosis) at the index coronary angiography and no clinical diagnosis of CHD (>70% stenosis) on angiography, MI, revascularization, or death due to CHD within four years of index coronary angiography.
The demographics of these individuals are summarized in Table 2.
Genome-wide DNA methylation and genetic assessments for each of these 253 subjects were conducted by the University of Minnesota Genome Center using the Illumina Infinium MethylationEpic Beadchip array and the Illumina Infinium Multi-Ethnic Global BeadChip array (San Diego, CA, USA), respectively. These data were then subjected to the same quality control procedure described above for the FHS samples.
Example 4—Iowa CohortThe second independent validation cohort consisted of 167 subjects. The demographics are shown in Table 3. The presence or absence of a clinical diagnosis of CHD was through medical records.
Because one of the aims of this study is to translate array-based methylation loci to clinically implementable digital PCR (dPCR) assays, which has fixed constraints on precision, prior to performing data mining exclusively using data from the FHS training set, the methylation variables were reduced to include loci based on delta beta (Δβ) (absolute difference between case and controls). All methylation loci beta values were converted into M-values and subsequently scaled to have zero mean and unit variance.
All data mining, feature selection, model development and model tuning were performed exclusively on the FHS training set. Our data mining approach has been outlined in previous publications (Dogan et al., 2018, Genes, 9:641; Dogan et al., 2018, PLoS One, 13:e0190549). All analyses were performed in Python. Briefly, an undersampling-based approach was implemented to account for the high class imbalance and coupled to an ensemble of machine learning algorithms that incorporated cross-validation to uncover non-linear methylation-SNP interactions and highly predictive biosignatures in the FHS training set (Han et al., 2011, Data Mining: Concepts and Techniques, Elsevier). As a result, a marker set was selected consisting of six DNA methylation loci and ten SNPs that had the best combined performance with respect to area under the receiver operating characteristic curve (AUC), sensitivity and specificity. The ensemble model consisting of these 16 biomarkers underwent hyperparameter tuning and was finalized for testing.
Example 6—Survival Analysis and Prognostic ScoresUsing data from the FHS, a Kaplan-Meier survival curve and Cox Proportional Hazards can be fitted to display CHD as a function of risk group (high vs. low) as predicted by the integrated genetic-epigenetic model. The y-axis represents the probability of not having CHD. The 95% confidence interval (CI) for each of the distribution was calculated and the distributions of the high and low risk groups were compared using the log-rank test.
Example 7—ResultsThe clinical and demographic characteristics of the FHS, IM and Iowa cohorts are outlined in Tables 1, 2, and 3, respectively. All of the subjects from the FHS cohort were of European ancestry, but non-European ancestry was represented in the IM and Iowa cohorts. The most notable difference was with respect to gendering. Compared to the FHS and IM cohorts, the CHD controls (those not diagnosed with CHD) were younger in the Iowa cohort.
Example 8—Integrated Genetic-Epigenetic Coronary Heart Disease Risk Prediction ModelUsing integrated genome-wide SNP and methylation data from the training sets, a CHD prediction model was built to identify those that have CHD. All subjects had genetic (SNPs) and epigenetic (DNA methylation) molecular data. All data mining, variable selection, and model development work were performed on the FHS training set. The data from the FHS test set, and IM and Iowa independent external validation sets were used to validate the performance of the final model developed using the FHS training set. Using the data from the FHS training set, machine learning (a subset of artificial intelligence) procedures were used to develop a model for the detection of CHD. The final model was built using data from ten SNPs and six DNA methylation loci, for a total of 16 biomarkers. The performance of that model was then tuned and upon finalization, was independently examined in the FHS test and IM and Iowa independent external validation sets to better understand the generalizability of this biomarker panel.
This final ensemble model consisted of a total of 16 biomarkers, six of which were DNA methylation biomarkers and the remaining ten were SNPs. The six methylation loci are cg04988978 (5′ promoter region of MPO), cg21161138 (gene body of AHRR), cg12655112 (gene body of EHD4), cg03725309 (body of SARS1), cg12586707 (3′ intergenic region of CXCL1), and cg17901584 (gene body of DHCR24), while the ten SNPs are rs2869675 (gene body of PREX1), rs4376434 (intergenic region near LINC00972), rs12129789 (gene body of KCND3), rs7585056 (intergenic region near TMEM18), rs710987 (gene body of LINC010019), rs4639796 (gene body of ZBTB41), rs1333048 (3′ intergenic region of CDKN2B), rs12714414 (intergenic region near TMEM18), rs942317 (gene body of KTN1-AS1), and rs1441433 (gene body of PPP3CA).
The overall and sex-specific performances of this model for the detection of CHD across all three cohorts are shown in Table 4. As expected, PrecisionCHD had the best performance in the FHS training data set which was used to develop the model. More importantly, PrecisionCHD demonstrated robust generalizability. It had 75% or better sensitivity across all cohorts, with the highest validation sensitivity and specificity being 88% and 77% in the external Iowa validation cohort. Across the three sets (i.e. FHS test, IM and Iowa) not used in the training of the model, overall, the model performed with an average AUC, sensitivity and specificity of 81%, 80% and 75%, respectively. A 80% sensitivity (true positive rate) indicates that, of 100 individuals with CHD, 80 are identified correctly by PrecisionCHD. Similarly, a 75% specificity (true negative rate) indicates that, of 100 without CHD, 75 are identified correctly. Similarly, the average sensitivity and specificity for men were 81% and 73%, respectively. For women, the average sensitivity and specificity were 76% and 75%, respectively. Overall, the model performed similarly for both men and women and across cohorts, indicating minimal to no gender bias and robust generalizability.
Appendix A shows a list of CpGs whose methylation is associated with CVD. Appendix B shows a list of genes whose methylation is associated with CVD. Appendix C shows a list of SNPs associated with CVD. The numerical values provided in Appendix A, B, and C are the mean of 10-fold cross validation scores, AUC ROC (Area Under The Receiver Operating Characteristic Curve), sensitivity and specificity, which were computed by logistic regression. Sensitivity is the true positive rate and specificity is the true negative rate.
Example 10—PrecisionCHDPrecisionCHD is a quantitative test to aid in the early detection of coronary heart disease (CHD). See Table 5. This non-invasive test evaluates ten single nucleotide polymorphisms (SNPs) and six DNA methylation markers in genomic DNA isolated from human peripheral whole blood. Coronary heart disease results from heritable (genetic) and acquired, potentially modifiable lifestyle and environmental (epigenetic) factors. The PrecisionCHD early detection test measures certain complex genetic and epigenetic relationships associated with CHD and uses a machine learning model to predict CHD status.
The PrecisionCHD test is initially intended for adults between the ages of 35-80 presenting to be evaluated for coronary heart disease. The results of this test are intended to be interpreted by a healthcare provider in conjunction with a comprehensive medical evaluation. This test is not indicated for stand-alone coronary heart disease diagnostic purposes and is not intended to replace a healthcare professional's diagnosis and treatment of coronary heart disease.
As described herein, PrecisionCHD evaluates a total of 16 biomarkers, including ten SNP genotypes and six DNA methylation biomarkers. The PrecisionCHD test uses standard Taqman assays to profile genotypes and proprietary methylation sensitive digital PCR assays to profile DNA methylation markers. The biomarkers captured by PrecisionCHD map to several complex pathways associated with the biology and pathogenesis of CHD such as serine metabolism, cholesterol biosynthesis, smoking and inflammation. Cardio Diagnostics' Actionable Clinical Intelligence™ (ACI™) platform (see, for example, U.S. Application No. 63/488,463, incorporated herein by reference) maps biomarker information to modifiable risk factors for CHD to help guide personalized interventions.
Example 11—PrecisionCHD WorkflowUndergoing testing with PrecisionCHD is simple, fast, and convenient. The steps include:
-
- 1. Eligibility criteria:
- a. 35-80 years old
- b. Presenting to be evaluated for coronary heart disease
- i. Exclusion: patients that have undergone a bone marrow transplant are not eligible
- 2. Sample collection:
- a. Option 1: at-home lancet-based sample collection kit mailed directly to the patient upon receiving test order from a clinician.
- b. Option 2: blood draw in provider setting.
- c. Option 3: blood draw in a non-provider setting such as a mobile clinic, community center, phlebotomy center.
- 3. Sample processed at high-complexity CLIA lab to profile genotype and methylation biomarkers.
- 4. Analytics of biomarkers are performed and a clinical report is generated and shared with the ordering clinician.
- 5. Clinicians will also receive a login to the Actionable Clinical Intelligence platform (see, for example, U.S. Application No. 63/488,463, incorporated herein by reference) with supplementary mapping of molecular information to modifiable risk factors associated with the patient's status.
- 6. Clinician's discuss results with patient and prevention/management plan is outlined.
FIG. 4 outlines several potential action items that can be implemented based on a positive CHD signal using the PrecisionCHD test.
- 1. Eligibility criteria:
The methods described herein can assist in selecting and/or customizing an appropriate intervention. The effectiveness of an intervention such as lifestyle or pharmaceutical can be evaluated by administering the test before and after said intervention to quantify changes, if any, in the status or risk to inform other future interventions. This process can be repeated to identify optimal interventions for the individual patient. Therefore, the methods described herein can aid in managing the interventions for an individual patient and then subsequently monitoring the patient to evaluate the efficacy of such interventions.
Example 13 PrecisionCHD-Epi for Assessing Mortality Risk in Those with Coronary Heart DiseasePrecisionCHD™ is a powerful integrated genetic-epigenetic diagnostic tool for detecting the presence of coronary heart disease (CHD) in clinical populations or for life insurance post-issue health initiatives. Its genetics-free version, PrecisionCHD-Epi, is a sensitive screening tool for use by life insurance carriers to screen individuals for the presence of CHD during the underwriting phase. However, if CHD presence has already been established for an individual, we performed experiments to determine if PrecisionCHD-Epi can help further stratify the severity of CHD as it relates to mortality risk.
This is an important question for life insurers, as stratifying mortal risk in those already diagnosed with CHD is one of the highest risk endeavors for medical underwriters. Seeking to put these assessments on a firm medical basis, in an influential article published twenty years ago, Dr. Anthony Milano reviewed the extant literature and recommended using CHD class and left ventricular function as the linchpins of the mortal risk assessment (Milano, 2000, J. Insur. Med. New York, 32(3):167-85). Since that time, other markers of CHD mortality, such as N-terminal Pro Brain Natriuretic Protein (NT-ProBNP), have been added to the underwriting armamentarium to aid prediction (Bibbins-Domingo et al., 2007, JAMA, 297(2):169-76). Still, prediction of mortality in those with CHD is still not optimal. In 2009, Sijbrands and colleagues used the underwriting procedures of Nationale-Nederlanden to assess mortal risk in 62,334 Dutch male insurance applicants, including 3,963 subjects with current cardiovascular disease (CVD) (Sijbrands et al., 2009, PloS One, 4(5):e5457). Despite the complete availability of medical information, and the clear relationship between CVD and mortal risk, Sijbrands concluded that “decedents could not have been identified individually by the medical evaluation employed.”
PrecisionCHD-Epi may be able change this dynamic and help underwriters stratify mortal risk from CHD. PrecisionCHD-Epiuses the same six methylation sensitive digital PCR assays that power PrecisionCHD™ to sensitively screen for the molecular signature of CHD. Inherent in this approach is the understanding that not all forms of CHD are alike, and that both the treatment and prognoses of patients determined to have CHD will differ.
In order to demonstrate that point and show the power of this technology to predict CHD severity, we analyzed the clinical and epigenetic data from 244 subjects in the Framingham Heart Study (FHS) who were diagnosed by the FHS Endpoint Committee as having CHD at Wave 8 of the FHS Offspring Cohort Study (Cupples et al., 1988, “The Framingham Heart Study, Section 35. An Epidemiological Investigation of Cardiovascular Disease Survival following Cardiovascular Events: 30 Year Follow-up,” Lung and Blood Institute). Table 6 delineates the characteristics of the subjects. Underscoring the lack of recognition of CHD in women by current methods, two-thirds of those diagnosed with CHD were male. The rate of self-reported smoking was higher in the males, but low overall. The values for other parameters such as blood pressure and lipid values, were clinically unremarkable.
Table 7 shows the non-digitally transformed methylation values for each of the six CpG sites targeted by the MSdPCR battery. Males had significantly lower methylation values at cg12655112 and cg12586707, with demethylation at each site being associated with CHD status.
We then examined the relationship of the methylation markers to survival. A total of 71 FHS subjects (48 males and 23 females) with CHD were observed as dying in the follow up period. Using their survival times and standard proportional hazards modeling, we analyzed the relationship between methylation values at the six sites to survival using stepwise regression. A simple proportional hazards model constructed using all six MSdPCR markers predicted survival (p<0.006). Neither the addition of age, sex or any of the traditional predictors (lipids, HbAlc and BP) given in Table 1 improved prediction. Stepwise removal of non-significant markers resulted in a model of three markers, cg04988978, cg21161138 and cg12655112, that predicted survival Chi square of p-value of p<0.002 with logWorth values (−log of the p-value) of 2.00, 1.74 and 3.07, respectively.
Visualizing the effect of markers inside an interactive multivariate model of survival can be challenging. An alternative, yet less powerful method of understanding the relationship of methylation to survival can be had by simply plotting the relationship of whether or not the subject died during follow-up to methylation.
In summary, using either PrecisionCHD or PrecisionCHD-Epi, we can now provide survival estimates for those at risk of developing CHD or those determined to have CHD.
It is to be understood that, while the methods and compositions of matter have been described herein in conjunction with a number of different aspects, the foregoing description of the various aspects is intended to illustrate and not limit the scope of the methods and compositions of matter. Other aspects, advantages, and modifications are within the scope of the following claims.
Disclosed are methods and compositions that can be used for, can be used in conjunction with, can be used in preparation for, or are products of the disclosed methods and compositions. These and other materials are disclosed herein, and it is understood that combinations, subsets, interactions, groups, etc. of these methods and compositions are disclosed. That is, while specific reference to each various individual and collective combinations and permutations of these compositions and methods may not be explicitly disclosed, each is specifically contemplated and described herein. For example, if a particular composition of matter or a particular method is disclosed and discussed and a number of compositions or methods are discussed, each and every combination and permutation of the compositions and the methods are specifically contemplated unless specifically indicated to the contrary. Likewise, any subset or combination of these is also specifically contemplated and disclosed.
Claims
1. A kit for determining methylation status of at least one CpG dinucleotide and a genotype of at least one single-nucleotide polymorphism (SNP), the kit comprising:
- at least one first nucleic acid primer at least 8 nucleotides in length that is complementary to a bisulfite-converted nucleic acid sequence comprising a first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or at a second CpG dinucleotide in linkage disequilibrium with the first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, wherein the linkage disequilibrium has a value of R>0.3, wherein the at least one first nucleic acid primer detects a methylated or unmethylated CpG dinucleotide, and
- at least one second nucleic acid primer at least 8 nucleotides in length that is complementary to a DNA sequence or a bisulfite-converted DNA sequence of a first SNP selected from the group consisting of rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or a second SNP in linkage disequilibrium with the first SNP selected from the group consisting of rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433, wherein the linkage disequilibrium has a value of R>0.3.
2. The kit of claim 1, wherein the at least one first nucleic acid primer detects the unmethylated CpG dinucleotide.
3. The kit of claim 1, wherein the at least one first nucleic acid primer detects the methylated CpG dinucleotide.
4. The kit of claim 1, wherein the at least one first nucleic acid primer comprises one or more nucleotide analogs.
5. The kit of claim 1, wherein the at least one first nucleic acid primer comprises one or more synthetic or non-natural nucleotides.
6. The kit of claim 1, further comprising a solid substrate to which the at least one first nucleic acid primer is bound.
7. The kit of claim 6, wherein the substrate is a polymer, glass, semiconductor, paper, metal, gel or hydrogel.
8. The kit of claim 6, wherein the solid substrate is a microarray or microfluidics card.
9. The kit of claim 1, further comprising a detectable label.
10. The kit of claim 1, further comprising at least a third nucleic acid primer at least 8 nucleotides in length that is complementary to a nucleic acid sequence upstream of the CpG dinucleotide.
11. The kit of claim 1, further comprising at least a third nucleic acid primer at least 8 nucleotides in length that is complementary to a nucleic acid sequence downstream of the CpG dinucleotide.
12. A method of determining the presence of biomarkers in a biological sample from a subject, wherein the biomarkers are associated with detecting CVD, determining severity of CVD, estimating survival from CVD, identifying, customizing, and/or optimizing intervention(s) for CVD, managing CVD and/or monitoring CVD, the method comprising:
- (a) providing a first portion of the biological sample and a second portion of the biological sample, wherein the nucleic acid from at least the first portion is bisulfite converted;
- (b) contacting the first portion of the biological sample with a first oligonucleotide primer at least 8 nucleotides in length that is complementary to a sequence that comprises a first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, or a second CpG dinucleotide in linkage disequilibrium with the first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, wherein the linkage disequilibrium has a value of R>0.3, wherein the first nucleic acid primer detects a methylated or unmethylated CpG dinucleotide; and
- (c) contacting the second portion of the biological sample with a nucleic acid primer at least 8 nucleotides in length that is complementary to a DNA sequence or a bisulfite-converted DNA sequence of a first SNP selected from the group consisting of rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or a second SNP in linkage disequilibrium with a first SNP selected from the group consisting of rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433, wherein the linkage disequilibrium has a value of R>0.3,
- wherein the percentage of methylation of the CpG dinucleotide at the GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, and the identity of the nucleotide at the first SNP selected from the group consisting of rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or the second SNP in linkage disequilibrium with the first SNP are biomarkers associated with detecting CVD or estimating survival from CVD.
13. The method of claim 12, wherein the biological sample is selected from the group consisting of blood and saliva.
14. The method of claim 12, wherein the window of incidence is three years.
15. A method of determining the presence of biomarkers in a biological sample from a subject, wherein the biomarkers are associated with detecting CVD, determining severity of CVD, estimating survival from CVD, identifying, customizing, and/or optimizing intervention(s) for CVD, managing CVD and/or monitoring CVD, the method comprising:
- (a) obtaining a nucleic acid sample from the subject sample;
- (b) performing a genotyping assay on a first portion of the nucleic acid sample to detect the presence of at least one SNP, wherein the at least one SNP is a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C and/or is a second SNP in linkage disequilibrium (R>0.3) with a first SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C to obtain genotype data; and/or
- (c) bisulfite converting the nucleic acid in a second portion of the nucleic acid and performing methylation assessment on a second portion of the nucleic acid sample to detect methylation status of at least one CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and/or a CpG site collinear (R>0.3) with a CpG selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A to obtain methylation data; and
- (d) entering the genotype data from step (b) and/or methylation data from step (c) into an algorithm that accounts for at least one SNP main effect and/or at least one CpG main effect and/or at least one interaction effect, wherein the algorithm is a machine learning algorithm capable of accounting for linear and non-linear effects.
16. The method of claim 15, wherein the at least one interaction effect is selected from the group consisting of a gene-environment interaction (SNP×CpG) effect, a gene-gene interaction (SNP×SNP) effect, and an environment-environment interaction (CpG×CpG) effect.
17. The method of claim 15, wherein the at least one interaction effect is a gene-environment interaction effect (SNP×CpG) between a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A or a CpG site that is collinear (R>0.3) with a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and a SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C or a SNP within moderate linkage disequilibrium (R>0.3) from a SNP selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C.
18. The method of claim 15, wherein the at least one interaction effect is an environment-environment interaction effect (CpG×CpG) between at least two CpG sites selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A.
19. The method of claim 18, wherein one or both of the at least two CpG sites are collinear (R>0.3) with one or both of the at least two CpG sites selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A.
20. The method of claim 15, wherein the at least one interaction effect is a gene-gene interaction effect (SNP×SNP) between at least two SNPs selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C.
21. The method of claim 20, wherein one or both of the at least two SNPs are collinear (R>0.3) with one or both of the at least two SNPs selected from rs2869675, rs4376434, rs12129789, rs7585056, rs710987, rs4639796, rs1333048, rs12714414, rs942317, and rs1441433 or from Appendix C.
22. The method of claim 15, wherein the biological sample is a saliva sample.
23. A kit for determining methylation status of at least one CpG dinucleotide, the kit comprising:
- at least one first nucleic acid primer at least 8 nucleotides in length that is complementary to a bisulfite-converted nucleic acid sequence comprising a first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or at a second CpG dinucleotide in linkage disequilibrium with the first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, wherein the linkage disequilibrium has a value of R>0.3, wherein the at least one first nucleic acid primer detects a methylated or unmethylated CpG dinucleotide.
24. A method of determining the presence of biomarkers in a biological sample from a subject, wherein the biomarkers are associated with detecting CVD, determining severity of CVD, estimating survival from CVD, identifying, customizing, and/or optimizing intervention(s) for CVD, managing CVD and/or monitoring CVD, the method comprising:
- (a) providing a biological sample from the subject at risk for or having CVD, wherein nucleic acids from at least a portion of the biological sample are bisulfite converted; and
- (b) contacting the bisulfite converted nucleic acids with a first oligonucleotide primer at least 8 nucleotides in length that is complementary to a sequence that comprises a first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, or a second CpG dinucleotide in linkage disequilibrium with the first CpG dinucleotide at a GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584, wherein the linkage disequilibrium has a value of R>0.3, wherein the first nucleic acid primer detects a methylated or unmethylated CpG dinucleotide,
- wherein the percentage of methylation of the CpG dinucleotide at the GC locus selected from the group consisting of cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 is associated with estimating survival of the subject.
25. A method of determining the presence of biomarkers in a biological sample from a subject, wherein the biomarkers are associated with detecting CVD, determining severity of CVD, estimating survival from CVD, identifying, customizing, and/or optimizing intervention(s) for CVD, managing CVD and/or monitoring CVD, the method comprising:
- (a) isolating nucleic acid sample from the subject sample;
- (b) bisulfite converting at least a portion of the nucleic acid and performing methylation assessment on the bisulfite converted nucleic acid to determine the methylation status of at least one CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A and/or a CpG site collinear (R>0.3) with a CpG site selected from cg04988978, cg21161138, cg12655112, cg03725309, cg12586707, and cg17901584 or from Appendix A to obtain methylation data; and
- (c) entering the methylation data from step (b) into an algorithm that accounts for at least one CpG main effect, wherein the algorithm is a machine learning algorithm capable of accounting for linear and non-linear effects.
Type: Application
Filed: Mar 29, 2024
Publication Date: Oct 3, 2024
Inventors: Meeshanthini V. Dogan (Chicago, IL), Timur K. Dogan (Chicago, IL), Robert A. Philibert (Chicago, IL)
Application Number: 18/621,902