SYSTEMS AND METHODS FOR MISSING DATA IMPUTATION

Info

Publication number: 20230121411
Type: Application
Filed: Sep 20, 2022
Publication Date: Apr 20, 2023
Applicant: The Regents of the University of California (Oakland, CA)
Inventors: Majid Sarrafzadeh (Los Angeles, CA), Myung-Kyung Suh (Los Angeles, CA)
Application Number: 17/949,076

Abstract

Congestive heart failure (CHF) is a leading cause of death in the United States. WANDA is a wireless health project that leverages sensor technology and wireless communication to monitor the health status of patients with CHF. The first pilot study of WANDA showed the system’s effectiveness for patients with CHF. However, WANDA experienced a considerable amount of missing data due to system misuse, nonuse, and failure. Missing data is highly undesirable as automated alarms may fail to notify healthcare professionals of potentially dangerous patient conditions. Embodiments of the present disclosure may utilize machine learning techniques including projection adjustment by contribution estimation regression (PACE), Bayesian methods, and voting feature interval (VFI) algorithms to predict both non-binomial and binomial data. The experimental results show that the aforementioned algorithms are superior to other methods with high accuracy and recall.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of United States Patent Application No. 14/241,431 filed Feb. 26, 2014, now United States Patent No. 11,450,413, which application is a 371 United States national phase application of PCT/US2012/052544, filed Aug. 27, 2012, which claims priority to U.S. Provisional Patent Application No. 61/528,065 filed Aug. 26, 2011, the contents of which are incorporated herein by reference in their entirety.

STATEMENT OF GOVERNMENT SPONSORED RESEARCH

This invention was made with Government support under Grant No. LM007356, awarded by the National Institutes of Health. The Government has certain rights in this invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a block diagram of a system 100 for collecting and imputing patient health data.

FIG. 2 illustrates one example of an embodiment of an HFSAS questionnaire.

FIG. 3 is a table showing the correlation coefficient values of each method.

FIG. 4 illustrates the accuracies of the obtained values range between 83.2% and 98.5%.

FIG. 5 is a table that illustrates the experimental result, and shows that naive Bayes, Bayesian network, and VFI have recall values ofup to 0.7 for weight, 0.714 for systolic blood pressure, 0.889 for diastolic blood pressure and 0.906 for heart rate values.

FIG. 6 is a table that illustrates the recall values.

DETAILED DESCRIPTION

Congestive heart failure (CHF) is a leading cause of death in the United States with approximately 670,000 individuals diagnosed every year. The sequelae of CHF are well known, with frequent decompensation of the chronic state resulting in recurrent hospitalizations. Experts believe that constant monitoring of patients with CHF is important to the health of such patients. Remote patient monitoring is a promising solution for an expanding population of CHF patients who are unable to access clinics due to insufficient resources, inconvenient location, or advanced infirmity. Medical care facilitated by remote technology has the potential to enable early detection of key clinical symptoms indicative of CHF-related decompensation. Such remote technologies can also enable health professionals to offer surveillance, advice, and continuity of care to trigger early implementation of strategies that enhance adherence behaviors.

The WANDA (Weight and Activity) project is one example of a wireless health project that leverages sensor technologies and remote communication to monitor the health status of patients with CHF. WANDA monitors health-related measurements and other information deemed relevant to CHF assessment, including weight, blood pressure, heart rate, activity, and daily somatic awareness scale questionnaires. Detailed descriptions of the WANDA system and its use for monitoring CHF patients can be found in Suh, M. et al., “WANDA B.: Weight and activity with blood pressure monitoring system for heart failure patients,” in 2010 IEEE International Symposium on A World of Wireless, Mobile and Multimedia Networks (WoWMoM), 2010, pp. 1-6; Suh, M. et al., “An automated vital sign monitoring system for congestive heart failure patients,” Proceedings of the 1st ACM International Health Informatics Symposium, 2010; and Suh, M. et al., “A remote patient monitoring system for congestive heart failure,” Journal of Medical Systems, 2011, all of which are incorporated herein by reference in their entirety for all purposes.

It is desired for a remote monitoring system such as WANDA to collect and store all monitored vital signs. Any unhealthy changes in a patient’s vital signs should be addressed promptly in order to prevent further degradation of a patient’s health. Unfortunately, the first randomized trial of WANDA experienced a considerable amount of missing data. Only 33% of the somatic questionnaires were completed, and 55.7% of data had missing values for weight, blood pressure, and heart rate. Moreover, 22.2% of patients experienced system misuse and requested help to accustom themselves to WAND A’s technologies. Missing data was further caused by system nonuse and service disorder (such as a network failure, resulting in as much as 6.3% of all of the missing data).

Notably, other studies have experienced similar data loss. Missing data is especially common in randomized controlled trials. Wood’s study showed that 89% of 71 trials published in 2001 in well-known journals (the British Medical Journal, the Journal of the American Medical Association, the Lancet, and the New England Journal of Medicine) reported having partly missing outcome values. Many studies applied last observation carried forward, worst case imputation, and complete case analysis techniques. However, such techniques may lead to biased results. To date, there has been no study on missing data imputation in CHF randomized trials.

One objective of embodiments of the present disclosure is to enhance the accuracy of CHF missing data imputation using data mining techniques. Data imputation may allow a patient monitoring system to detect an unhealthy change in patient vital signs even when portions of that data are not collected by the system. Embodiments of the present disclosure exploit the projection adjustment by contribution estimation (PACE) regression method for predicting and imputing non-binomial data such questionnaire responses. Bayesian methods and voting feature interval (VFI) are used to impute binomial data. Results of these methods may be compared using accuracy and correlation efficient values for non-binomial cases, and recall values for binomial cases. Previous methods may be compared with several other popular data mining methods. The experimental results show that PACE regression, Bayesian methods, and voting feature interval are superior to other methods for CHF patient data imputation.

FIG. 1 illustrates a block diagram of a system 100 for collecting and imputing patient health data. Patient data is collected from a patient 90 by at least one data collection device 102. As described above with respect to WANDA, the at least one data collection device may include a scale, a heart rate monitor, a blood pressure monitor, a motion-sensing activity monitor, and/or a computing device configured to collect questionnaire answers. In one embodiment, the data collection device 102 may be a separate device that collects data values from such devices at the location of the patient 90.

The data collection device 102 transmits the data to a patient data computing device 104, where the patient data is stored in a raw data store 106. In one embodiment, the data collection device 102 transmits the data to the patient data computing device 104 over a network such as a public switched telephone network; a wide area network; a local 10 area network; the Internet; a wireless network such as 3G, 4G, L TE, GSM, Bluetooth, WiFi, WiMax; and/or via any other suitable networking technology. In another embodiment, the data collection device 102 may be transported to the location of the patient data computing device 104, and may transmit the data to the patient data computing device 104 via a direct data connection between the devices, such as a USB connection, a Firewire connection, and/or the like.

A prediction engine 108 may then impute missing patient data values as discussed further below, and may store the imputed patient data values in a predicted data store 110. In some embodiments, the prediction engine 108 may search for missing values, and then perform the calculations described below to predict the missing values. If the predicted values are beyond threshold limits, such as a threshold limit specified by a caregiver, the patient data computing device 104 may generate an alert to be presented to the caregiver. The alert may include one or more predicted or measured values, which may then prompt the caregiver to check the status of the patient or to ask the patient to verify the predicted values. In cases where the predicted values do not match the actual status of the patient, the prediction engine 108 may use the actual status as training data for a subsequent prediction.

In some embodiments, the prediction engine 108 may include one or more computer-executable components stored on a computer-readable medium that, if executed by a processor of a computing device, cause the computing device to perform the actions described below. In some embodiments, the prediction engine 108 may include one or more computing devices specially configured to perform the described actions. In some embodiments, the raw data store 106 and the predicted data store 110 may be databases managed by a conventional relational database management system (RDBMS). One of ordinary skill in the art will recognize that the raw data store 106 and the predicted data store 110 may be separate databases, or may be stored in a single database. In other embodiments, the raw data store 106 and/or the predicted data store 110 may use any other suitable storage method, such as a structured query language (SQL) file, a spreadsheet, a text document, and/or the like.

In some embodiments, the patient data computing device 104 may include at least one processor, an interface for coupling the computing device to the data collection device 102, and a nontransitory computer-readable medium. The computer-readable medium may have computer-executable instructions stored thereon that, in response to execution by the processor, cause the patient data computing device 104 to perform the calculations described further below. One example of a suitable computing device is a personal computer specifically programmed to perform the actions described herein. This example should not be taken as limiting, as any suitable computing device, such as a laptop computer, a smartphone, a tablet computer, a cloud computing platform, an embedded device, and/or the like, may be used in various embodiments of the present disclosure. One of ordinary skill in the art will recognize that the components illustrated as part of the patient data computing device 104 may be combined into a single component, or may each be split apart into multiple components. Further, the patient data computing device 104 may be a single computing device that stores and/or executes each of the illustrated components, or may include multiple computing devices communicatively coupled to each other that each store and/or execute part or all of the illustrated components.

Non-Binomial Case Imputation

In one embodiment, WANDA may employ the Heart Failure Somatic Awareness Scale (HFSAS) which is a 12-item Likert-type scale to measure awareness of signs and symptoms specific to CHF. A 4-point Likert-type scale is used to ascertain how much a patient is bothered by a symptom (0: not at all, 1: a little, 2: a great deal, 3: extremely). FIG. 2 illustrates one example of an embodiment of an HFSAS questionnaire. In order to predict missing answers to such a questionnaire, embodiments of the present disclosure may use the projection adjustment by contribution estimation regression algorithm (PACE) (rounding any non-integer value returned by PACE). This method is based on maximum likelihood estimation (MLE) and an empirical Bayes framework to minimize the Kullback-Leibler (KL) distance between the original and the estimation function.

First, the PACE algorithm transforms parameters usmg MLE’s asymptotic normality property to convert the original parameters. The algorithm utilizes the empirical Bayes estimator in (1 ):

$θ_{i}^{m} = \frac{\int θ f ((x_{i}| θ) d G_{k} (θ)}{\int f ((x_{i}| θ) d G_{k} (θ)}$

where θ (x) is the estimator f (x_i | θ_i ) is a probability density function (PDF) and G_k is a consistent estimator of G which is the mixing distribution of the mixture f_G = ∫f(x|θ)dG. Using (2), the developed algorithm minimizes the KL distance between f andf in (2):

$Δ_{K L} (f, \tilde{f}) = E_{f} \log (\frac{f}{\tilde{f}}) = \int \log (\frac{f}{\tilde{f}}) f d x$

This method may show better results in high dimensional data spaces, and was applied to complete cases that have all 12 answered questions to evaluate the accuracy.

Binomial Case Imputation

A binomial approach may be used to predict alarms normally triggered by abnormal data values ( e.g., drastic weight changes, unhealthy blood pressure, etc.) given missing data. For example, the system may be configured to trigger an alarm if a patient has an extreme change in weight - even when the extreme weight value is missing from the data collected by WANDA. Embodiments of the present disclosure may use naive Bayes, a Bayesian network, and VFI to detect such changes in order to alert caregivers. Naive Bayes and Bayesian network classifiers are algorithms that approach the classification problem using the conditional probabilities of the features. A Bayesian network is a directed acyclic graph (DAG) over a set of variables X, where the outgoing edges of a variable xi specifies all variables that depend on xi. The probability of an outcome is determined as:

$P (X) = Π_{x \in X} p ((x| par (x))$

where X = {_x1, _x2, ... , x_k} is a set of variables, and par(x) is the set of parents of x in a Bayesian network. The probability of the instance belonging to a single class may be calculated by using the prior probabilities of classes and the feature values for an instance. Naive Bayesian method assumes that features are independent and there are no hidden or latent attributes in the prediction process. As such, the experimental results for naive Bayes and Bayesian network can be slightly different as p( class) =

$\begin{array}{l} \frac{1 + N (class)}{N (class) + N (instances)} for na ï ve Bayes and p (class) = \\ \frac{\frac{1}{2} + N (class)}{N (class) \times \frac{1}{2} +N (instances)} for \end{array}$

Bayesian network where N(x) is the number of sets or instances.

VFI is a categorical classification algorithm and considers each feature independently as Bayes methods. The classification of a new instance may be based on a vote among the classifications built by the value of each feature. While training, the VFI algorithm constructs intervals for each feature. For the classification, a single value and the votes of each class in that interval are calculated for each interval. For each class c, feature f gives a vote value:

$feature_vote [f,c] = \frac{interval_class_count [f,i,c]}{class_count [c]}$

where interval_ class_ count [ f,i,c] is the number of instances of class c which is a member of interval i of feature f. The class with the highest total vote is predicted to be the class of the test instance.

In the Bayes methods, each feature participates in the classification by assigning probability for each class and the final probability of a class is the product of each probability measured on each feature. In VFI, each feature distributes its vote among classes and the final vote of a class is the sum of each vote given the features.

Subjects and Datasets

The WANDA system was used for health data collection on 26 different subjects. The population of the participants was approximately 68% male; 40% White, 13% Black, 32% Latino, and 15% Asian/Pacific Islander; with a mean age of approximately 68.7 ± 12.1. Study participants were all provided with Bluetooth weight scales, blood pressure monitors, land line gateways, and personal activity monitor devices. Each captured data instance for the study comprises 3 7 different attributes including, but not limited to: timestamps; weight; diastolic/systolic blood pressure; heart rate; metabolic equivalents (METs); calorie expenditure; and numeric responses to twelve somatic awareness questions. Each data instance was gathered from each subject once a day. One thousand and ninety instances were gathered.

The study used the missing at random (MAR) hypothesis. MAR assumes that missing data is dependent on observed data. Hence, missing data can be predicted by resident data. All 1090 instances of data are complete (i.e., contain all 37 data values). Instances were divided into to two groups: training and testing. Values from the testing set predicted by the data imputation techniques were compared to their actual values to evaluate the effectiveness of each system.

Example Results

For non-binomial data, PACE, linear, simple linear and isotonic regression methods were applied. FIG. 3 is a table showing the correlation coefficient values of each method. Correlation coefficient is a measure of least square fitting to the original data. For a given N data points (X,Y), the correlation coefficient px,_Y is given as equation (5) where COV(X,Y) is a covariance between X and Y and ox, σy are standard deviation values of X and Y. The experimental results show that PACE regression method works better on average than other given regression methods.

$ρ_{X, Y} = \frac{C O V (X, Y)}{σ_{X} \times σ_{Y}}$

After calculating the coefficient and constant variables, the developed algorithm determines missing values using PACE regression (rounding any non-integer value returned by PACE). The accuracies of the obtained values range between 83.2% and 98.5%, as shown in FIG. 4.

The binomial case predicts a potential abnormal vital sign when missing data exist within WANDA’s database. C4.5, random tree, naive Bayes, Bayesian network, VFI, nearest neighbor, PART, DTNB, decision table, and rotation table algorithms were applied and their recall values were compared. For each method, ten-fold cross validation was applied. In ten-fold validation, the original sample is randomly partitioned into ten subsets and a single subset is held as a testing model, with the remaining nine subsets are used as training data. This cross-validation process is then repeated ten times, using a new subset as a testing model for each repetition. Recall values are given as:

$recall = \frac{T p}{T p + F n}$

where T_P is true positive and F_n is false negative. FIG. 5 is a table that illustrates the experimental result, and shows that naive Bayes, Bayesian network, and VFI have recall values ofup to 0.7 for weight, 0.714 for systolic blood pressure, 0.889 for diastolic blood pressure and 0.906 for heart rate values.

Classifiers were trained in two ways. First, unique classifiers were created for each individual where only data collected from an individual was used to predict values from the same individual. Second, a grouped classifier was created using data from the entire population. Both the individual and grouped classifiers were compared using tenfold validation to test data from 16 patients. The recall values of weight, blood pressure, and heart rate are improved when training on the entire group’s data as compared with training each individual’s data separately. FIG. 6 is a table that illustrates the recall values. For questionnaire data, the accuracies of results were also better when training on all patients’ data. When training individually, 75% of patients’ data showed 0% accuracy. This is because the entire group has bigger number of data and many individual share similarities in monitored attributes, such as age, symptoms of CHF, etc.

The accuracy of the CHF missing data was enhanced using the PACE regression method for predicting and imputing non-binomial data; and Bayesian methods and voting feature interval for binomial data. The experimental results show that PACE regression works better than linear regression, simple linear regression, and isotonic regression methods with accuracy values of more than 83.2%. The experiment comparing Bayes and VFI methods with other algorithms proves that Bayes and VFI algorithms work better (FIG. 5) with recall values of up to 0.7 for weight, 0.714 for systolic blood pressure, 0.889 for diastolic blood pressure and 0.906 for heart rate values. This study also showed that increased accuracy is obtained by training on a large population as opposed to training the classifiers for each individual independently.

While a preferred embodiment of the invention has been illustrated and described, it will be appreciated that various changes can be made therein without departing from the spirit and scope of the invention.

Claims

1. A system configured to impute missing patient data for health care monitoring.