PINS: A Perturbation Clustering Approach for Data Integration and Disease Subtyping
Disease subtyping is accomplished by a computer-implemented algorithm that manipulates a first genetic dataset to construct a set of first connectivity matrices. To this set of matrices Gaussian noise is introduced to generate a perturbed dataset. The computer-implemented algorithm assesses which of the set of first connectivity matrices was least affected by introduction of noise and that matric is used to define the optimal clustering. Once the optimal clustering is determined, computer-implemented supervised classification is performed to determine, for a particular patient, with which disease subtype cluster that person's genetic data most closely aligns. Armed with this knowledge, the treatment regimen is specified with much higher likelihood of success.
This application is a divisional of U.S. application Ser. No. 15/068,048 filed on Mar. 11, 2016, which claims the benefit of U.S. Provisional Application No. 62/132,263 filed on Mar. 12, 2015 and U.S. Provisional Application No. 62/221,727 filed on Sep. 22, 2015. The entire disclosure of each of the above applications is incorporated herein by reference.
FIELDThe present disclosure relates to disease diagnosis using genetic analysis.
BACKGROUNDThe advent of high-throughput genomics technologies has resulted in massive amounts of diverse genome-scale data. Gene expression data, measured by microarrays or next generation sequencing platforms, are the most prevalent data type available for biological data analysis. Gene Expression Omnibus stores thousands of datasets with independent experimental series of similar patient cohorts and experiment design. As technologies advance, other data types become available and together they offer complementary information on the same disease or biological phenomenon. The Cancer Genome Atlas (TCGA) has already gathered genome, transcriptome, and epigenome information for over 20 cancers for thousands of patients. The challenge is to interpret the massive amounts of high-dimensional and heterogeneous data types to gain insights into biological processes.
Disease subtyping is often the first step to better understand a disease or biological phenomenon. The goal is to detect unknown groups of patients based on intrinsic features without external information. The disease subtyping problem includes the following fundamental issues: 1) how to determine the number of clusters and assign patients to each group, 2) how to combine complementary information to determine the final partitioning. The former problem often involves clustering mRNA expression where the data has small sample size but very high dimension. This is still an important problem since gene expression is one of the most prevalent data type available. The latter problem includes integration of multi-omics data, such as mRNA expression, DNA methylation, and miRNA, for class discovery. With the rapidly advancing technologies, more and more data types are available for the same set of patients, making the increasing need for combining multi-omics data.
In functional genomics, agglomerative hierarchical clustering (HC) is a frequently used approach for clustering genes or samples that show similar expression patterns. HC pro-vides for a structural view of the data that makes it appealing in exploratory data analysis. However, classical HC imposes a tree structure on the data that might not reflect the underlying structure, and is highly sensitive to the metric used to assess similarity among elements. Divisive clustering methods, such as k-means, global k-means, fuzzy modification of k-means, have been applied for the same application. These methods provide clear cluster boundaries and tighter clusters, but they lack the visual appeal of HC. Another group of methods are neural network clustering, such as self-organizing maps (SOM), Self-Organizing Tree Algorithm (SOTA), and Dynamically Growing Self-Organizing Tree (DGSOT). Neural networks can be modeled as a collection of nodes with weighted interconnections, which can be adaptively learned. The common drawbacks of both k-means based methods and neural networks based methods is the need to specify the number of clusters beforehand.
Resampling-based methods have been proposed to determine the number of clusters. They assess the stability of the clustering results with respect to resampling variability. Arguably the state-of-the-art approach in this area is Consensus Clustering (CC). It develops a general, model independent resampling-based methodology of class discovery, cluster validation, and visualization. CC calculates the pair-wise similarities (frequency of how often the elements are grouped together) and their empirical cumulative distribution function (CDF) using sub-sampling. The pair-wise similarities are then used for visualization and for estimating the cluster number. This approach has been widely used and gained laudable results. The main assumption of CC is that if the samples were drawn from K distinct sub-populations that truly exist, different sub-samples would show the greatest level of stability at the true K. Unfortunately, this makes CC claim apparent structure when there is none, or declare cluster stability when the stability is subtle.
The goal of an integrative analysis is to identify subgroups of samples that are similar not only at one level (e.g., mRNA), but from a holistic perspective, that can take into consideration phenomena at various other levels (e.g., DNA methylation, miRNA, etc.). One strategy is to analyze each data type independently before combining them. One of the drawbacks of this approach is that it might lead to inconsistent conclusions that are hard to integrate. Another approach is to use machine learning techniques. However, these methods are not scalable to the full spectrum of measurements, making them sensitive to gene selection step. One recent approach, Similarity Network Fusion (SNF), creates a network of patients for each data type before fusing the network using a metric fusion technique developed for image processing applications. The fused network is then partitioned using spectral clustering. The unstable nature of the spectral clustering and the metric fusion technique makes the method sensitive to its parameters. In addition, this method is not designed to solve the clustering when only one data type is available.
SUMMARYHere we present a new approach to address both of the mentioned issues. Our framework is divided into two stages. In the first stage, we solve the classical clustering problem given a single data type. Although several specific high-dimensionality data types are illustrated in our examples here, our technology is general enough to be applicable for any high-dimensional genetic or life science data. The second stage combines the partitionings of individual data types to determine the final partitioning. In our experimental study, we evaluate the first stage by clustering 8 gene expression datasets of different diseases. For all the 8 datasets, PINS outperforms its competitors in recovering the true classes. To evaluate the second stage, we downloaded mRNA, DNA methylation, and miRNA data of 6 difference cancers from TCGA: kidney renal clear cell carcinoma (KIRC), glioblastoma (GBM), lung squamous cell carcinoma (LUSC), breast invasive carcinoma (BRCA), acute myeloid leukemia (LAML), and colon adeno-carcinoma (COAD) with survival and clinical data. PINS substantially outperforms other methods in identifying subtypes and in predicting survival using the multi-omics data.
In this disclosure, we present a new technological approach to address both of the mentioned issues. We refer to our new technological approach by the acronym PINS (Perturbation clustering approach for data INtegration and disease Subtyping). Our technology is divided into two stages. In the first stage, we solve in a new way the classical clustering problem given a single data type. While particularly well suited to analyzing genetic data, our approach is general enough to be applicable for any high-dimensional data. The second stage combines the partitionings of individual data types to determine the final partitioning.
In our experimental study described here, we evaluate the first stage by clustering 8 gene expression datasets of different diseases. For all the 8 datasets, PINS outperforms its competitors in recovering the true classes. To evaluate the second stage, we downloaded mRNA, DNA methylation, and miRNA data of 6 difference cancers from TCGA: kidney renal clear cell carcinoma (KIRC), glioblastoma (GBM), lung squamous cell carcinoma (LUSC), breast invasive carcinoma (BRCA), acute myeloid leukemia (LAML), and colon adenocarcinoma (COAD) with survival and clinical data. PINS substantially outperforms other methods in identifying subtypes and in predicting survival using the multi-omics data.
II. METHODSHere we describe the perturbation clustering for a single data type and data integration for multiple data type. This section is organized as follows. Section II-A describes the perturbation clustering (stage I), in which the patients are partitioned using one data type. This first stage outputs the clustering and the pair-wise connectivity of the patients. Section II-B describes the data integration and disease subtyping (stage II) using multi-omics data. This second stage consists of 2 phases. In phase 1, the pair-wise connectivity (between patients) for multiple data types are combined to form the network between patients. This network is then partitioned to determine the grouping using the integrated data. In phase 2, we further split each group into sub-groups if possible. The output of the phase 2 is then reported as the output of PINS using the multi-omics data.
A. Perturbation Clustering (Stage I)
In stage I of PINS, we solve the classical clustering problem, i.e., we focus on clustering samples (patients) using one data type. The approach is based on the observation that small changes in any kind of quantitative assay will be inherently present between individuals, even in a truly homogeneous population in the absence of any subtypes. Here, the hypothesis is that if well-defined subtypes of disease do exist, these have to be stable with respect to small changes in the measured values. Hence, we are not interested in any clusters that form or disappear due to small changes in the data, but rather we are looking for those groupings that remain stable across many clusterings built in the presence of small changes. In order to find such clusters, we add Gaussian noise to the data and reconstruct the clustering many times. The stability is assessed by the discrepancies in the clustering results between the original and the perturbed data. Based on this, we extract the “true” number of clusters as being the one that is least affected by such perturbations. In the absence of any true subtypes, the repeated clusterings will show lack of stability thus allowing us to avoid the discovery of false subtypes.
The framework will be described here using k-means clustering as the basic building block of our subtype discovery approach, but a number of other classical clustering approaches could be used instead. It is well-known that the k-means algorithm may converge to a local minimum depending on the initialization. To overcome this problem, we use the “modified version” of k-means, i.e., we run k-means many times with different random initialization and then choose the result that gives the least residual sum of squares (RSS). In the rest of this manuscript, the term k-means refers to the “modified version” of k-means.
The high-level algorithm can be briefly described as follows: i) For a given number of clusters k, we cluster the original data using k-means and then construct the connectivity between patients (original connectivity). ii) We add noise to the data and re-cluster the perturbed data many times to determine the average connectivity between patients when the data are perturbed (perturbed connectivity). iii) We calculate the discrepancy between the original clustering and the perturbed clustering for each k. iv) We repeat the above steps for all values of kin a range of interest (e.g., 2..10). v) We choose the k which gives the least discrepancy between the original and perturbed connectivity. The corresponding clustering is then returned as the most stable one.
Our approach differs from the existing methods in the following aspects: i) we accept the noisy nature of the biological measurement and use the data as they are (without data pre-processing) and therefore do not suffer from information loss and do not require a preliminary feature selection, nor a dimensionality reduction; ii) our stability metric is expected to deterministically and reliably identify the number of clusters present in the data.
1. Construction of Original Connectivity Matrices (Steps 1-2)
In step (1), we partition the patients using all possible number of clusters k=[2.. K]. Formally, the input I can be presented as a set of N patients I={e1, e2, . . . , eN} where each element ei is in a Mth dimensional space and represents the molecular profile of the ith patient (i∈[1..N]). A partitioning Pk (k clusters) of I can be written in the form Pk={P1, P2, . . . , Pk} where Pi is a set of patients, such that ∪i=1k Pi=I and Pi∩Pj=Ø, ∀i, j∈[1..k], i≠j. After step (1), we have (K−1) partitionings: {P2, . . . , PK}, one for each value of k∈[2..K].
In step (2), we build the pair-wise connectivity between the patients using the partitionings obtained from step (1). For a partitioning Pk, two patients are connected if they are clustered together. We build the connectivity matrix Ck∈{0,1}N×N from the partitioning Pk={P1, P2, , Pk} as follows:
In other words, the connectivity between two patients is 1 if and only if they belong to the same cluster. Let us consider one example. We cluster a set of 5 elements into 2 clusters with the resulted partitioning P2={{1, 2},{3, 4, 5}}. In this case, element 1 is connected to element 2 and is not connected to other elements {3, 4, and 5}. Similarly, elements {3, 4, 5} are all connected to each other, but not to elements {1, 2}. Using equation (1), we construct the connectivity matrix for P2 is as follows:
Intuitively, a partitioning can be presented as a graph in which each patient is a node and the connectivity between two patients is an edge, such that the edge exists if and only if the two patients have similar gene expression profile and thus are clustered together. Any two patients of a cluster are connected by an edge, and any two patients of different clusters are not connected. The connectivity matrix of is exactly the adjacency matrix of the graph.
We construct one connectivity matrix for each value of k∈[2..K]. After step (2), we have (K−1) connectivity matrices C2, . . . , CK. We refer to these matrices as original connectivity matrices because they are constructed from the original data without perturbation or resampling.
2. Generating Perturbed Datasets (Steps 3-4)
In order to assess the stability of the partitionings obtained in steps (1-2), we generate H new datasets by adding Gaussian noise to the original data I. In step (3), we calculate the noise variance from the input. We first calculate the variances of the M components (columns). For example, for gene expression assay of 20,000 genes, we will get 20,000 variances, each variance for a gene represents the variability of that gene among the individuals. We then choose the median of the M variances to be the noise variance. Our reasoning is that the majority of the genes should have similar expression across individuals. The difference between individuals for those genes is due to technical variability and individual heterogeneity. By choosing the median variance, we hope that our noise setting is automatically adjusted to the noise of the system. Formally, the noise variance is calculated as follows:
In step (4), we generate new datasets J(h)∈RN×M(h∈[1..H]) by adding Gaussian noise to the original data as follows:
J(h)=I+N(0,σ2) (3)
where σ2 is calculated in equation (2). After this step, we have H perturbed datasets J(1), J(2), . . . , J(H). We refer to these datasets as perturbed datasets because they are generated by perturbing the original data. The perturbed datasets will be used to compute the perturbed connectivity matrices in the following section.
3. Construction of Perturbed Connectivity Matrices (Steps 5-7)
In step (5), we cluster each of the H perturbed datasets using k-means with varying values of k∈[2..K]. For example, for k=2, we partition the dataset J(1) into 2 clusters and get the Q2(1) partitioning. We perform k-means with k=2 for all the H perturbed datasets and get H different partitionings Q2(1), Q2(2), . . . , Q2 for k=2. Please note that all these perturbed datasets were generated by adding small noise to the same input I. In the ideal case, Q2(1), Q2(2), . . . , Q2(H) are all identical to P2. The more difference between them, the less reliable the P2 partitioning.
After step (5), we have H different partitionings Qk(1), Qk(2), . . . , Qk(H), for each value of k∈[2..K]. In step (6), we construct a connectivity matrix for each partitioning created in step (5). Specifically, for the partitioning Qk(h)(h∈[1..H], k∈[2..K]), we construct the connectivity matrix Gk(h)∈{0,1}N×N as follows:
After this step, we have H connectivity matrices Gk(1), Gk(2), . . . , Gk(H) for a value of k. In the context of graph, each connectivity matrix can be considered as the presentation of the network between patients. For a given value of k, Ck represents the network for the original data while Gk(h) represents the network between patients each time we perturb the data. The stability of the clustering is assessed based on the discrepancy between these altered networks and the original network. We first combine the altered networks before comparing the combined network to the original network. In step (7), we calculate the perturbed connectivity matrix by averaging the connectivity from Gk(1), Gk(2), . . . , Gk(H) as follows:
where Ak∈[0,1]N×N, k∈[2..K]. We refer to these matrices as perturbed connectivity matrices. For each value of k∈[2..K], we have one original connectivity matrix and one perturbed connectivity matrix. The discrepancy between the two matrix reflects the stability of the partitioning Pk.
4) Stability Assessment (Step 8-10)
Given the number of cluster k, we would like to quantify the discrepancy between the Ck and Ak. We calculate the difference matrix Dk∈[0,1]N×N as follows:
Dk=|Ck−Ak| (6)
Dk(i,j) represents the change in connectivity between i and j when the data are perturbed. Dk consists of numbers falling into the interval [0,1]. The distribution of the entries of Dk reflects the stability of the clustering. The more this distribution shifts towards 1, the less robust the clustering. To quantify the discrepancy, we compute the empirical cumulative distribution function (CDT) of the for the entries of Dk. In step (9) we compute the function Fk as follows:
where the numerator represents the number of elements in Dk that are smaller than or equal to c while the denominator represents the total number of elements in the matrix Dk.
In step (10), we calculate the area under the curve AUCk of the CDFs. When Ck and Ak are identical (i.e., data perturbation do not change the clustering result), the difference matrix Dk consists of only 0's. In this case, Fk(0)=1, and thus the area under the curve AUCk to be maximized, i.e., AUCk=1. If Ck and Ak differ from each other, then the entries of Dk shift towards 1, making AUCk smaller than 1. The more different between Ck and Ak, the smaller the AUCk. The smaller the AUCk, the more stable the partitioning. Therefore, we choose the optimal {circumflex over (k)} for which the area under the curve (AUC) is maximized.
At the end of stage I, we return the optimal value of {circumflex over (k)}, the partitioning P{circumflex over (k)}, the original connectivity matrix C{circumflex over (k)}, and the perturbed connectivity matrix A{circumflex over (k)}. The connectivity matrices C{circumflex over (k)}, A{circumflex over (k)} represent the network between patients for one data type. These matrices will be used to combine multi-omics data for the final clustering in stage II of PINS.
To illustrate the workflow of the algorithm, we simulate 10 simulated datasets named Gaussian1, Gaussian2, . . . , Gaussian10. The number in each name is the number of classes of the dataset. Each dataset has 100 samples and 1,000 genes. The samples are equally divided among the classes. For example Gaussian2 has 2 classes of size 50 and Gaussian3 has 3 classes of size 33 and 34. We will show that the AUC values are notably low for Gaussian1 dataset, which suggests that any partitioning of this dataset is very unstable against data perturbation (
Visually, the perturbed connectivity matrix A2 in panel (B) shows that data points are randomly connected. This is also true for any other value of k∈[2..10]. In summary, the original connectivity greatly disagree with the perturbed connectivity, which reflects the real structure of the data.
As we understand the behavior of PINS for random data, we would like to know how PINS works on datasets that have separable classes.
Similarly,
As a summary, we display the AUC values of all the 10 simulated datasets from Gaussian1 to Gaussian10 in
Before moving to a detailed discussion of how subtyping of multi-omics data are processed,
Whether a single dataset 10 or a multi-omics dataset 12 is being used, the dataset is processed to construct the original connectivity matrices 14, as described in detail above. In parallel with the original connectivity matrices construction, perturbed datasets are generated as at 16. This is done, as described above by injecting a suitably configured Gaussian noise into the data. The perturbed datasets are then used to construct perturbed connectivity matrices 18.
With original and perturbed connectivity matrices now both constructed, a stability assessment is performed at 20. As a result of this assessment, the computer-implemented algorithm identifies several important data values that describe the optimal clusters for the given datasets. These data values include, the optimal value of k, designating the optimal number of clusters; and the optimal partitioning of the dataset, indicating to which cluster each person's data belongs. In addition, the algorithm stores the original connectivity matrix and the perturbed connectivity matrix, for use in subsequent processing steps. Note that the stability assessment step defines these data values for each dataset supplied. Thus if the dataset 10 was used, a single optimal k value and single optimal partition would be generated. If dataset 12 were instead used, the stability assessment step would generate a separate optimal k value for each data type (e.g., mRNA, DNA methylation, and miRNA) and a separate optimal partition for each data type as well.
Note that the process illustrated in
B. Subtyping Multi-Omics Data
In this section, we describe the workflow of PINS for multi-omics data. Let us denote T as the number of data types. The input of PINS is a set of T matrices where ∥={I1, I2, . . . , IT} where Ii∈RN×M
Our disclosed workflow consists of two stages: i) integrate the data and cluster the patients, ii) further split each group into subgroups if possible. In stage I, we construct the combined similarity matrix between patients using the connectivity information from individual data types. We then combine 3 similarity-based algorithms to determine the final partitioning of the multi-omics data. In stage II, we further split each discovered subtype if possible.
1) Stage I—Data Integration and Subtyping:
The algorithm starts by clustering each data type using the perturbation clustering (as described above). Consider the ith data type with the data matrix Ii. The perturbation clustering estimates {circumflex over (k)}i as the number of clusters and then partitions the data into {circumflex over (k)}i clusters. The algorithm returns the original connectivity matrix Ci, in which the connectivity between elements of the same clusters is 1 and the connectivity between elements of different clusters is 0. Please note that the index i here denotes the index of the data type. For T data types, we have T original connectivity matrices C1; C2; . . . ; CT. We combine the connectivity matrices for the original data as follows:
We refer to SC as the original similarity matrix because it is constructed from the original connectivity matrices. If we consider each patient as a node, and the connectivity between two patients is an edge, then each connectivity matrix for each data type represents a graph. Our goal is to identify subgraphs that are strongly connected across all data types.
We then measure the agreement between the T data types using the concept similar to the pair-wise agreement of Rand index (RI). Given two partitionings of the same set of items, the RI is calculated as the number of pairs that “agree”, divided by the total number of possible pairs. A pair “agrees” if the two samples are either grouped together in both partitionings or they are separated in both partitionings. We extend this concept to T partitionings of T data types. First we defined that the connectivity between two patients is consistent if it does not change across data types. We then define the agreement of T data types as the number of pairs having consistent connectivity, divided by the total number of possible pairs. In other words, the agreement between the data types can be calculated as follows:
If the majority of pairs are consistent, i.e., agree(SC)>50%, we say that the T data types have strong agreement. In this case, we define a strong similarity matrix ŜC as follows:
where ŜC(i, j)=1 if and only if i and j are clustered together in all data types and 0 otherwise. A hierarchical clustering is then applied on this matrix and the resulting tree is cut at the height that provides maximum cluster separation.
When the data types do not have strong agreement, we perform a cluster ensemble of 3 different methods as will be explained as follows. The matrix {SC(i, j)} represents the similarity between patients, and therefore {1−ŜC(i, j)} represents the pair-wise distance between patients, which can be directly used by similarity-based clustering algorithms, such as hierarchical clustering, partitioning around medoids, or dynamic tree cut. Here we use all the 3 algorithms to partition the patients and then choose the partition that agrees the most with the partitionings of individual data types. The dynamic tree cut algorithm can automatically determines the number of clusters, but the other two algorithms, hierarchical clustering (HC) and partitioning around medoids (PAM), need us to provide the number of clusters.
To determine the number of clusters for HC and PAM, we introduce the perturbed similarity matrix, which can be calculated as follows:
where Ai is the perturbed connectivity matrix of the ith data type. Please note that SC is constructed by averaging the original connectivity of T data types while SA is constructed by averaging the perturbed connectivity of T data types. We use this both matrices to assess the stability of HC and PAM.
For hierarchical clustering, we first build the H1 tree using the original similarity matrix SC, and then we build the H2 tree using the perturbed similarity matrix SA. For each value of k, we cut H1 to get k clusters and then build the connectivity matrix. We do the same for H2 and then calculate the instability dk as the sum of absolute difference between the two connectivity matrices. We choose k for which the dk is the smallest, i.e., {circumflex over (k)}=argmaxk(dk,k∈[2..K]).
Similarly for PAM, we partition the patients using both original and perturbed similarity matrices. For each value of k, we have one partitioning using the original similarity matrix SC and one partitioning using the perturbed similarity matrix SA. We build the connectivity matrices from the two partitioning and then calculate the instability dk as the absolute difference between the two connectivity matrices. We choose {circumflex over (k)} for which the dk is the smallest, i.e., {circumflex over (k)}=argmaxk(dk,k∈[2..K])
After having the 3 partitionings using the 3 similarity-based clustering algorithms, we calculate the agreement between each partitioning and the T data types. Again, we use the agreement concept introduced in Equation (10). For each algorithm, we calculate the agreement between its partitioning and the T partitionings for the T data types. We then choose the result of the algorithm that has the highest agreement with the T data types.
2) Stage II—Splitting Groups into Subgroups
In stage II, we further split one discovered group of patients at a time, if possible. We check each group independently. If a group has more than ⅔ of the total samples, we run the procedure described in stage I again, but this time the input consists of only the patients belonging to the group we are working on. The goal is to separate samples of this group, that would not be possible with the presence of samples from other groups.
If a group has less than ⅔ but more than ⅓ of the total samples, we need to check the agreement between the T data types. We take into consideration only the samples belonging to this group. We cluster each data type and build the T connectivity matrices. Here we calculate the agreement between the data types using Equation (10). If the agreement is more than 50% (i.e., the majority of pairs agree across all data types), we further split the group. Otherwise, the group is not split.
The Kaplan-Meier survival curves for these groups are shown in
The survival curves of the final partitioning is displayed in
Remarkably, the significantly different groups can be obtained only when the 3 types of data are integrated and analyzed together. PINS cannot find subgroups with significantly different survival for any one of the single data types: mRNA, methylation, and miRNA (more details in Table IV in section III-B). However, when all types of data are integrated by our approach, the p-value of the obtained subtypes becomes 4 orders of magnitude more significant.
The above-described computer-implemented algorithm for subtyping multi-omics data will now be summarized with reference to
If at step 26 there is not strong agreement, the algorithm applies plural different clustering algorithms, as indicated diagrammatically at 29. These plural algorithms are effectively run in parallel. While a number of different clustering algorithms may be utilized, for purposes of explaining the technique,
Referring now to
If the group is deemed to be of medium size, at step 37, some additional processing is performed. However, if the group is already below the medium size (e.g., less than one-third of the total samples, it will be retained as-is, without requiring further subdividing, as indicated at 41. However if the group is of medium size, the algorithm checks at step 38 to check agreement among data types. If agreement among data types is high (e.g., greater than 50% agreement), as detected at 39, the group is eligible to be further subdivided. Thus at step 40, the group is sent back to be further subdivided. On the other hand, if agreement among data types is not high the group is left un-split as at 41.
III. EXPERIMENTAL STUDIESOur experimental studies include a wide range of cancers using a single data type as well as using multi-omics data. To evaluate the perturbation clustering (stage I) using a single data type, we download 8 gene expression datasets with known classes (subtypes) from Gene Expression Omnibus and Broad Institute websites. The performance of each clustering method is assessed by comparing their partitionings against the true classes of each dataset. To evaluate the data integration (stage II) of multi-omics data, we download mRNA, methylation, and miRNA data of 6 different cancers from The Cancer Genome Atlas (TCGA) website. The performance of the clustering methods are assessed by comparing the survival of the patients.
A. Experimental Studies Using Gene Expression Data
In this section we assess the performance of PINS in clustering a single data type (stage I). Details of the 8 gene expression datasets are described in Table I. The 5 datasets GSE10245, GSE19188, GSE43580, GSE15061, and GSE14924 were downloaded from Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/). The 3 datasets Lung2001 (http://www.broadinstitute.org/mpr/lung/), AML2004 (http://www.broadinstitute.org/cancer/pub/nmf), and Brain2002 (http://www.broadinstitute.org/MPR/CNS/) were downloaded from the corresponding websites of Broad Institute.
We compare the performance of PINS with the performance of the other 2 state-of-the-art clustering algorithms, Consensus Clustering (CC), and Similarity Network Fusion (SNF). The range of the number of clusters k is set to [2..10] for all 3 clustering algorithms. A suitable computer implementation of Consensus Clustering is found in the R statistical software package (ConsensusClusterPlus version 1.18.0). The code for CC is run according to get the change in area under the curve Δ(k) when the number of clusters k increases. We choose the number of clusters {circumflex over (k)} where the CDF levels off and the corresponding Δ({circumflex over (k)}) gets close to zero, according to the classical CC manuscript.
Regarding Similarity Network Fusion (SNF), although SNF focuses on the data integration, it also provides an option to cluster a single data type. The R package of SNF tool version 2.1 were downloaded from Bioconductor website. The code is run according to the description of the software. We calculate the number of clusters for SNF using the function estimateNumberOfClustersGivenGraph with the range set to [2..10]. This function returns 4 possible choices. We use the first one as the number of clusters in this study.
For all the 8 gene expression datasets, we know the true labels (subtypes) of each sample. Therefore, we use Rand Index (RI) and adjusted Rand Index (ARI) as the metrics to assess the agreement between the clustering and the ground truth (true classes of the elements). Briefly, Rand Index of 2 partitionings the number of pairs that agree divided by the total number of pairs. In short,
where a is the number of pairs that are clustered together in both partitionings, b is the number of pairs that are separated in both partitionings, and
is the total possible pairs from N elements. The adjusted Rand index (ARI) is the corrected-for-chance version of the Rand index (Appendix A). The clustering results are calculated using all genes without filtering. However, for illustration purpose, the clustering results will be displayed only in the first 2 principal components.
The summary of all results is shown in Table II. The first 3 columns in the table show the names, sample numbers, and true class numbers of the data sets. The next 3 columns show the number of clusters, RI, and ARI for the clustering results of PINS. The last 6 columns show those of CC and SNF. For each dataset (row), cells highlighted in green have the highest RI and ARI. We put NA in the result of SNF for dataset GSE14924 because it returns an error message without a result. For all the 8 datasets, PINS achieves higher clustering performance than SNF and CC.
B. Experimental Studies Using Multi-Omics Data
1) Analysis Across a Wide Spectrum of Cancer:
We downloaded 6 different cancer datasets from The Cancer Genome Atlas (TCGA): glioblastoma multiform (GBM), lung squamous cell carcinoma (LUSC), breast invasive carcinoma (BRCA), acute myeloid leukemia (LAML), kidney renal clear cell carcinoma (KIRC), and colon adenocarcinoma (COAD). For each cancer dataset, we downloaded TCGA-curated level 3 data of mRNA expression, DNA methylation, and mRNA expression. We analyze the set of patients that have measurements across all the 3 data types. TCGA contains different platforms for each data type. We choose the platforms of each data type so that they have the largest set of common patients.
Table III displays details of the data for the 6 cancer datasets. The number of samples is the set of patients that have measurements across all the 3 data types. The number of component for a data type is the number of measurements for a patient for that data type. The expression values of DNA methylation fall between 0 and 1 and the expression values of microarray measurements (gene expression) fall between 2 and 14. We use these data as they are without processing. For sequencing data, since the values are too large (up to millions), we use their log transformation (base 2).
Since the Consensus Clustering (CC) does not have the functionality to integrate multiple data types, we compare PINS against SNF in this section. Using the clinical data from TCGA, we calculate the Cox log-rank test p-values for the each partitioning. We note that Cox p-values were also used to assess clustering performance for SNF. We report the number of clusters and Cox p-values for each data type as well as for the integrated data in Table IV. The first 3 columns describe the data while the next 4 columns show the number of clusters and Cox p-value for PINS and SNF. The results for the integrated data are displayed in bold. The cells highlighted in green have significant p-values (cutoff 0.05). SNF gives significant p-value for only LAML while PINS gives significant p-values for KIRC, GBM, LUSC, BRCA, and LAML.
For the kidney renal clear cell carcinoma (KIRC) dataset, neither algorithm can find groups with significantly different survival using any single data type. SNF cannot find significant different groups even after data integration. In contrast, when all data types are integrated by PINS, the p-value of the obtained subtypes becomes 4 order of magnitude more significant.
For glioblastoma multiform (GBM) dataset, SNF cannot find significant different groups using mRNA or miRNA but it finds 2 significant different groups using methylation data (p=0.017). Similarly, PINS cannot find significant groups using mRNA or miRNA but finds 2 significantly different groups using methylation data (p=10−4). SNF cannot find significant different groups after data integration despite having significantly different groups using methylation data. In contrast, data integration by PINS finds 3 groups with even more significant p-value than those of individual data types (p=8.7×10−5).
C. First Case Study—Glioblastoma Multiform (GBM)
We use PINS to subtype multi-omics data for 273 patients with glioblastoma multiform (GBM). The data types are mRNA expression (HT HG-U133A), DNA methylation (HumanMethylation27), and miRNA expression (Illumina HiSeq miRNASeq) as shown in Table III.
We downloaded the somatic mutation data for GBM from TCGA website. Among 273 samples, only 125 samples have somatic mutation information. Here we take into consideration a mutation (gene) if it is positive in at least 5 samples. We count the number of mutations in each subtype for each gene. We then calculate the enrichment p-value using Fisher exact test and then adjust for multiple comparison using FDR correction. Table V displays the 3 mutations that are enriched after FDR correction (at cutoff 0.01). We can see that IDH1 and ATRX mutations only appear in subtype 2 and not in other subtypes.
Here we integrate three data types (mRNA, DNA methylation, miRNA) of 131 patients (Table III).
We downloaded the somatic mutation data for the samples in GBM dataset from TCGA, and apply the same approach explained above for the samples in GBM. The genes which are mutationally enriched in each of the subtypes are shown in Table VI. Again, we observe that the subtypes found for GBM data set are significantly over-represented by samples that have at least one mutation in the genes that are mutated frequently in the samples in the different survival groups.
D. Second Case Study—Kidney Renal Clear Cell Carcinoma (KIRC):
Multiple integrative approaches have been applied to identify the subtypes of kidney renal clear cell carcinoma (KIRC). Depending on the data types, these analyses can lead to different conclusions.
Here we integrate three data types (mRNA, DNA methylation, miRNA) of 131 patients (Table III).
We downloaded the somatic mutation data for the samples in KIRC dataset from TCGA. Using the mutation information for the samples, which have been subtyped using the approach described above, we identify genes that are mutated frequently in the samples in the different survival groups. The significance of mutations for each gene in each subtype is assessed by the number of samples with at least one mutation in that gene in that subtype and in the whole analysis using Fisher's exact test. We apply this approach to the KIRC datasets downloaded from TCGA. The genes which are mutationally enriched in each of the subtypes are shown in Table V. We observe that the subtypes found in the analysis are significantly over-represented by samples that have at least one mutation in these genes.
We use PINS to integrate multi-omics data for 124 patients with kidney renal clear cell carcinoma (KIRC). The data types are mRNA expression (Illumina HiSeq), DNA methylation (HumanMethylation27), and miRNA expression (Illumina GASeq) as shown in Table III.
In stage II, subtype 1 is equally divided into 2 subgroups with very different survival rates. Subtype 3 is not considered in stage II because it consists two few samples. Subtype 2 is not divided in stage II because the 3 data types give very contradictory signals for this group. In summary, PINS discovers 4 different subtypes with very different survival profiles (p=1.3×10−4). The significant different subtypes can be obtained only when the 3 data types are integrated and analyzed together. Although the subtype discovery was done on molecular data alone, with no use of clinical information, the 4 groups identified have significantly different clinical profiles: group 1-1 contains short-term survival women, group 1-2 contains longer-term survival women, group 2 contains only men, and group 3 contains survivors (all patients were still alive at the end of the study).
1) KIRC Clinical Parameters:
Enrichment for different clinical characteristics was analyzed for each of the four survival clusters. Table VI shows the numbers and percentages of each of the 124 patients into each of the survival clusters and clinical categories. FDR adjusted p-values were calculated for phenotype enrichment in each of the clusters versus the others, and good versus poor survivors. Using an FDR cutoff of 5%, we find that group 3 (all survivors) are typically between ages 50 and 60, with normal calcium levels, while the medium survivors are typically younger, and poor survivors tend to be over 70. Group 2 (medium survival male) tends to have low calcium and fall into grade III. Poor survivors (female group 1-1) have elevated platelets and elevated calcium, and fall into grade G4, stages III and IV. Group 1-2 (medium-good survival females) are predominantly stage I with normal hemoglobin. The two pure female clusters (1-1 and 1-2) include the majority of patients with elevated platelets. The low survivors (1-1 and 2) compared to the high survivors (groups 3 and 1-2) have low hemoglobin and high grade.
2) KIRC Female Subgroups—Clinical Parameters:
The low survival group includes 86% of the Stage IV cases, while the high survival group includes 77% of the Stage I cases, representing an FDR corrected p-values of less than 0.5%. Other parameters that were significant at FDR<5% included tumor grade (poor survivors had a higher incidence of ‘G4’ tumors, FDR 2%), tumor status (poor survivors were ‘with tumor’, FDR 2%), hemoglobin level (poor survivors had low levels, FDR 2%), metastasis (2%).
3) KIRC Female Subgroups—Differential Gene Expression:
Based on absolute log-fold-change of 1.5 and maximum FDR adjusted p-value of 1%, there were 165 differentially expressed genes between long- and short-term survivors. Eighty percent (132) of these were down-regulated in the poor survivors. Functional analysis of all 165 genes using iPathwayGuide points to damaged proximal tubules in the nephrons of women with poor outcome. Most common renal cell carcinomas occur in the proximal tubules. Two KEGG pathways had FDRs on the order of 10−4: ‘Metabolic Pathways’, and ‘Mineral Absorption’. Several differentially expressed solute carriers on the Mineral Absorption Pathway are located in ‘brush border membrane’, shown in
4) KIRC Female Subgroups—MiRNAs:
iPathway Guide also outputs a ranked list of miRNAs that are up-regulated between short and long survival women. We select the top 10 miRNA and performs the t-test using miRNA expression. Two microRNAs that were significantly up-regulated (after FDR correction) in the low survival group were among the 10 miRNAs identified by iPathwayGuide with significant enrichment of down-regulated target genes: hsa-mir-497 and hsa-mir-27a dysregulation. These 2 miRNAs have been observed in multiple cancers and may be up or down regulated depending on the context (mircancer.ecu.edu).
Hsa-mir-497 is a tumor suppressor reported to be involved in antiproliferation (cell cycle—G1 arrest, p53 correlation), increased apoptosis (through WEE1), suppression of angiogenesis (through VEGFA), suppression of migration and invasion (through SMURF1), and modulation of multidrug resistance (through BCL2). Hsa-mir-497 dysregulation has been found in carcinomas (prostate, bladder, colon, pancreas, breast, lung, gastric, liver, cervical, peritoneal, . . . ), and is thought to participate in the following biological processes/pathways: cadherin, WNT, T-cell activation, cell-cycle progression, apoptosis, PI3K/AKT, and MAPK/ERK.
Hsa-mir-27a is considered to be an ‘onco-miR’ with potential SNP inactivation. It is associated with numerous cancers: breast (familial), cervical (HeLa), glioma, AML, ALL, renal, colorectal, prostate, gastric, ovarian, pancreatic, lung, . . . and carcinomas: oral squamous cell, hepatocellular, esophageal squamous cell, gastric adenocarcinoma, . . . It has been implicated in promoting cellular proliferation, migration and invasion and inhibiting apoptosis (through MCPH1, FOX01 and SPRY2), control of endothelial cell repulsion and angiogenesis (through SEMA6A and VEGF), promoting metastasis by inducing epithelial to mesenchymal transition, enhanced expression of proinflammatory cytokines, and impairment of adipocyte differentiation and mitochondrial function. It is thought to participate in the following biological processes and pathways: cell adhesion and cell-cell interactions, VEGF mediated signaling, MAPK/ERK signaling, EGFR signaling, cell cycle, NF-kb signaling, proinflammatory cytokines, and basal transcription of the p53-273-H/mir-27a/EGFR pathway.
IV. EXEMPLARY EMBODIMENTReferring now to
Thereafter, a DNA sample from an individual patient is obtained at 74 and analyzed with a benchtop analyzer 72. Essentially, the benchtop analyzer 72 is obtaining sample data from the individual patient that is identical to or of a similar character to the data collected from the plurality of human subjects as processed by the laboratory genetic analyzer 50. The output from the benchtop analyzer 72 is supplied to data processor 66. Data processor has an associated non-transitory memory 64 into which has been loaded the cluster data from memory 62, representing the previously discovered unsupervised optimal clusters. Processor 66 uses the data from the individual patient to perform a supervised classification analysis which identifies which cluster most closely represents the data from the individual patient, as shown at 70.
With the individual patient now assigned to the optimal cluster that most closely corresponds to that patient's actual genetically analyzed data, the treating physician selects the treatment regimen that is best suited to that patient's needs. As noted above, without this knowledge the conventional treatment protocol might well dictate that the patient be given the “most popular” treatment, unaware that in this case that treatment will not work and thus valuable life-giving time is being wasted.
V. CONCLUSIONSIn this disclosure, we present a new approach for data integration and disease subtyping. Our contribution is two-folds. First, we proposed a novel method to efficiently cluster high-dimensional data. This approach adds noise to the input to learn the data's behavior. The algorithm then chooses the partitioning that is the most robust against data perturbation. Second, we integrate multi-omics data by combining the similarity matrices of individual data types. Our framework looks for strong connections across all data types to determine the number of clusters for the final partitioning. This makes the partitioning more stable than by looking at the partitioning of each data type alone.
The advantage of the new approach is demonstrated by extensive data analysis. We examine 8 gene expression datasets of different diseases: lung cancer, leukemia, and brain tumors. Rand Index (RI) and Adjusted Rand Index (ARI) are used as metrics to compare the performance of PINS, Consensus Clustering (CC), and Similarity Network Fusion (SNF). For all the 8 datasets, Perturbation Clustering outperforms its competitor and also correctly identifies the number of subtypes for most of the datasets.
To evaluate the new approach's ability to combine multi-omics data, we examine 6 cancers available on The Cancer Genome Atlas (TCGA): glioblastoma multiform (GBM), lung squamous cell carcinoma (LUSC), breast invasive carcinoma (BRCA), acute myeloid leukemia (LAML), kidney renal clear cell carcinoma (KIRC) and colon adenocarcinoma (COAD). Using the Cox log-rank test, we show that our framework has a clear advantage among the competing methods.
APPENDIXAdjusted Rand Index
We use Rand index (RI) and adjusted Rand Index (ARI) as the metrics to assess the agreement between a clustering and the ground truth (true classes of the elements). Rand Index of 2 partitionings is the number of pairs that agree divided by the total number of pairs. In short,
where a is the number of pairs that are clustered together in both partitionings, b is the number of pairs that are separated in both partitionings, and
is the total possible pairs from N elements
The adjusted Rand index (ARI) is the corrected-for-chance version of the Rand index. Let us consider two partitionings {C1, C2, . . . , CΓ} and {G1, G2, . . . , GΓ}. The agreement between these two partitionings is summarized by the contingency Table IX. The adjusted Rand index (ARI) is the corrected-for-chance version of the Rand index. ARI can be calculated as follows:
While RI falls into the interval [0,1], ARI can be negative. It can be shown that ARI has an expected value of 0 for two random partitionings.
Claims
1-18. (canceled)
19. A disease subtyping/patient subgrouping computerized method based on the molecular profile of a patient, the computerized method comprising:
- receiving assay results of samples collected from a population of human subjects;
- generating a population dataset from the assay results that defines a molecular profile for the population;
- receiving an assay result of a patient sample;
- generating a patient dataset from the assay result that defines a molecular profile of the patient; and
- analyzing the population dataset by: a) acquiring the population dataset from the population of human subjects; b) performing an unsupervised cluster analysis on the population dataset to define clusters of the population, wherein each cluster represents a disease subtype or a patient subgroup; c) using machine learning techniques to act predetermined features of each disease subtype or patient subgroup; and d) finding a disease subtype or patient subgroup that represents the closest match to the molecular profile of the patient.
20. The computerized method according to claim 19, further comprising:
- analyzing the patient dataset and selecting a disease treatment regimen.
21. The computerized method according to claim 19, wherein the population of human subjects comprises a plurality of human subjects that are stricken with the same disease.
22. The computerized method according to claim 21, wherein the patient has the same disease as the plurality of human subjects and the method comprises finding the disease subtype of the patient's disease.
23. The computerized method according to claim 22, further comprising treating the patient based on the disease subtype.
24. he computerized method according to claim 19, wherein the population of human subjects comprises a plurality of human subjects that have been treated with the same drug.
25. The method according to claim 24, wherein the patient has been treated with the same drug as the plurality of human subjects and the method comprises subgrouping the plurality of human subjects based on responses to the drug.
26. The computerized method according to claim 25, further comprising finding the subgroup of human subjects to which the patient belongs based on the patient's response to the drug.
27. The computerized method according to claim 19,
- wherein the population of human subjects comprises a plurality of human subjects who have the same disease, wherein the disease manifests differently in groups of the plurality of subjects, and the method further comprises: storing the clusters of the population in a data store comprising a non-transitory computer-readable memory; processing the patient dataset using a computer-implemented classifier, the classifier ingesting the clusters in the data store and using the clusters to find which of the clusters represents a closest match to the patient dataset;
- storing the cluster found to represent the closest match to the patient dataset as patient classification data in a non-transitory computer-readable memory; and
- using the patient classification data in the selection of a disease treatment regimen,
- wherein the unsupervised cluster analysis comprises:
- (a) applying a computer-implemented algorithm to the population dataset to construct a set of first connectivity matrices which are then stored in non-transitory computer-readable memory;
- (b) using a computer-implemented algorithm to construct and store in non-transitory computer-readable memory a perturbed dataset by introducing noise into the population dataset;
- (c) applying a computer-implemented algorithm to the perturbed dataset to construct a set of perturbed connectivity matrices which are then stored in non-transitory computer-readable memory;
- (d) using a computer-implemented algorithm to perform a stability assessment that reads from memory and compares the first connectivity matrices with the perturbed connectivity matrices;
- (e) using a computer-implemented algorithm to select from among the first set of connectivity matrices the one matrix whose corresponding perturbed matrix was least affected by the introduction of noise; and
- (f) storing the selected one matrix in non-transit computer-readable memory and using a computer-implemented algorithm to construct the clusters.
28. The computerized method of claim 27 wherein the assay results provide quantitative biological data selected from the group consisting of mRNA expression, DNA methylation, miRNA expression, protein abundance (proteomics), and metabolic concentrations (metabolomics) and genetic data.
29. The computerized method according to claim 19, further comprising:
- using the population dataset as input, wherein the population of human subjects have the same disease, the population dataset being developed by high-throughput assays selected from the group consisting of mRNA expression, DNA methylation, miRNA expression, proteomic expression, and metabolic concentration,
- forming the molecular profile for the population, and performing the following steps:
- a) applying k-means to the population dataset with different settings of numbers of clusters, each setting resulting in one clustering of patients;
- b) constructing an original connectivity matrix for each clustering of patients;
- c) introducing Gaussian noise to the dataset and constructing the perturbed connectivity matrices;
- d) performing stability assessment between the original and perturbed connectivity, matrices;
- e) selecting one original connectivity matrix that is the least affected by noise; and
- f) choosing a cluster of patients that corresponds to the selected connectivity matrix as the optimal subtyping.
30. The computerized method according to claim 19, further comprising:
- generating the population dataset from a plurality of types of molecular data obtained from the population of human subjects each human subject in the population having the same disease,
- defining T molecular profiles for the population of human subjects and performing the following steps:
- a) applying the algorithm defined in (1)-(6) below to define subtypes of the disease for each data type:
- (1) applying a computer-implemented algorithm to the first quantitative biological dataset to construct a set of first connectivity matrices which are then stored in non-transitory computer-readable memory;
- (2) using a computer-implemented algorithm to construct and store in non-transitory computer-readable memory a perturbed dataset by introducing noise into the first quantitative biological dataset;
- (3) applying a computer-implemented algorithm to the perturbed dataset to construct a set of perturbed connectivity matrices which are then stored in non-transitory computer-readable memory;
- (4) using a computer-implemented algorithm to perform a stability assessment that reads from memory and compares the first connectivity matrices with the perturbed connectivity matrices;
- (5) using a computer-implemented algorithm to select from among the first set of connectivity matrices the one matrix whose corresponding perturbed matrix is least affected by the introduction of noise; and
- (6) storing the selected one matrix in non-transitory computer-readable memory and using a computer-implemented algorithm to construct the plurality of clusters,
- b) constructing the connectivity matrix for each data type;
- c) constructing the similarity matrix from T connectivity matrices to gain a holistic view of the integrated data; and
- d) partitioning the integrated similarity matrix to define subtypes of the disease using the integrated data.
31. The computerized method according to claim 19, comprising:
- generating the population dataset from the population of human subjects, each human subject of the population having the disease;
- processing the population dataset using the unsupervised cluster analysis to define the clusters;
- storing the clusters in a data store comprising a non-transitory computer-readable memory;
- processing the patient dataset using a computer-implemented classifier, the classifier ingesting the clusters in the data store and using the clusters to find which of the clusters represents a closest match to the patient dataset;
- storing the cluster found to represent the closest match as patient classification data in a non-transitory computer-readable memory; and
- using the patient classification data in the selection of a disease treatment regimen,
- wherein the unsupervised cluster analysis comprises:
- (a) applying a computer-implemented algorithm to the population dataset construct a set of first connectivity matrices which are then stored in non-transitory computer-readable memory;
- (b) using a computer-implemented algorithm to construct and store in non-transitory computer-readable memory a perturbed dataset by introducing noise into the population dataset;
- (c) applying a computer-implemented algorithm to the perturbed dataset to construct a set of perturbed connectivity matrices which are then stored in non-transitory computer-readable memory;
- (d) using a computer-implemented algorithm to perform a stability assessment that reads from memory and compares the first connectivity matrices with the perturbed connectivity matrices;
- (e) using a computer-implemented algorithm to select from among the first set of connectivity matrices the one matrix whose corresponding perturbed matrix is least affected by the introduction of noise; and
- (f) storing the selected one matrix in non-transitory computer-readable memory and using a computer-implemented algorithm to construct the clusters.
32. A system comprising:
- at least one processor and
- a memory coupled to the at least one processor; wherein the memory stores instructions for execution by the at least one processor, and wherein the instructions include: receiving assay results of samples collected from a population of human subjects; generating a population dataset from the assay results that defines a molecular profile for the population; receiving an assay result of a patient sample; generating a patient dataset from the assay result that defines a molecular profile of the patient; and analyzing the population dataset by: a) acquiring the population dataset from the population of human subjects; b) performing an unsupervised cluster analysis on the population dataset to define clusters of the population, wherein each cluster represents a disease subtype or a patient subgroup; c) using machine learning techniques to extract predetermined features of each disease subtype or patient subgroup; and d) finding a disease subtype or patient subgroup that represents the closest match to the molecular profile of the patient.
Type: Application
Filed: Jan 3, 2020
Publication Date: Nov 5, 2020
Inventors: Sorin DRAGHICI (Detroit, MI), Tin Chi NGUYEN (Detroit, MI)
Application Number: 16/733,660