EFFICIENT REAL-TIME DATA QUALITY ANALYSIS
Embodiments of the invention are directed a computer-implemented method for efficiently assessing data quality metrics. A non-limiting example of the computer-implemented method includes receiving, using a processor, a plurality of updates to data points in a data stream. The processor is further used to provide a plurality of data quality metrics (DQMs), and to maintain information on how much the plurality of DQMs are changing over time. The processor also maintains information on computational overhead for the plurality of DQMs, and also updates data quality information based on the maintained information.
The present invention relates generally to programmable computers, and more specifically to programmable computers, computer-implemented methods, and computer program products that implement new data quality metrics and related data quality analysis techniques. In accordance with aspects of the invention, the new data quality metrics and related data quality analysis techniques are configured to efficiently allocate and utilize the computing resources required to perform incremental data quality analysis on data sets having new or updated data that changes over time.
In computer processor applications, the phrase “big data” refers to extremely large data sets that can be analyzed computationally to reveal patterns, trends, and associations, especially those relating to human behavior and interactions. Big data can be leveraged by sophisticated computational analysis systems such as machine learning (ML) algorithms, artificial intelligence (AI) algorithms, deep learning algorithms, internet of things (IoT) systems, and the like. Because the demand for ML experts has outpaced the supply, user-friendly automated ML/AI computer systems have been developed. Known automated AI/ML systems can automate a variety of AI/ML development tasks such as model selection, hyper-parameter optimization, automated stacking (ensembles), neural architecture searching, pipeline optimization, and feature engineering.
The performance of automated ML/AI systems depends heavily on how well the system's training data has been qualitatively cleaned and analyzed to make the data suitable for effective consumption by ML models. If undetected, poor data quality can cause large deviations in the analysis performed by sophisticated AI/ML systems, thereby generating inaccurate and misleading results. Because the process of identifying and addressing data quality issues can be labor intensive, a scalable, automated, and interactive data quality tool/system known as the Data Quality Advisor system or the DQLearn system has been developed by IBM®. For ease of description, the terms data quality analysis (DQA) system are used herein to refer, collectively, to the Data Quality Advisor (or DQLearn) system, as well as other known data quality analysis systems having similar features as the DQA/DQLearn system. The framework of the DQA system performs a variety of data quality analysis tasks including automatically generating dynamic executable graphs for performing data validations fine-tuned for a given dataset; building a library of validation checks common to many applications; and applying different tools to address the persistence and usability issues that make data cleaning a laborious task for data scientists.
DQA systems perform data quality checks on data sets that are constantly being streamed through a data quality analysis pipeline. The data quality checks involve measuring and/or analyzing different features or characteristics of the data sets to generate data quality metrics (DQMs) that provide a user with feedback on data quality. For example, a known data quality check is to identify/measure NULL values in a data set, and a known corresponding DQM is the percentage of NULL values in the data set. In many instances, the data-under-analysis includes data that has been previously checked for data quality, along with data that has not been previously checked for data quality (i.e., new/updated data).
DQA systems continuously update DQM measurements as new/updated data is received, and recalculating these DQMs can consume a significant amount of computational resources. To address this problem, known DQA systems can compute DQMs incrementally, which means that, instead of taking the computationally expensive approach of re-computing DQMs on the entire initial data set each time the data set comes up for a data quality evaluation, DQMs are computed “incrementally” by computing the DQMs for the new/updated data and combining them with DQM information that was previously computed for the unchanged portion(s) of the initial data set.
Data quality analysis techniques that incrementally compute DQMs for new/updated data provide a level of computing resource efficiency over data quality analysis techniques that do not compute DQMs incrementally for new/updated data. However, in known data quality analysis techniques, the DQMs themselves are static and do not take into account the fact that, when viewed over selected time windows, the new/updated data from which the DQMs are derived is non-static and constantly changing. Accordingly, known data quality analysis techniques that incrementally determine DQMs for new/updated data still lack efficiency in their allocation and use of computing resources because known data quality analysis techniques are not well matched to new/updated data that is constantly changing over time.
SUMMARYEmbodiments of the invention are directed a computer-implemented method for efficiently assessing data quality metrics. A non-limiting example of the computer-implemented method includes receiving, using a processor, a plurality of updates to data points in a data stream. The processor is further used to provide a plurality of data quality metrics (DQMs), and to maintain information on how much the plurality of DQMs are changing over time. The processor also maintains information on computational overhead for the plurality of DQMs, and also updates data quality information based on the maintained information.
The above-described computer-implemented method provides improvements over known methods of assessing data quality by maintaining information on computational overhead for its DQMs computed over time, and also by updating data quality information based on the maintained information. Accordingly, the above-described computer-implemented method efficiently allocates computing resources because it computes DQMs and allocates DQM computational resources in a manner that is well matched to incoming data points that are constantly changing over time.
The above-described computer-implemented method can further include receiving at least one importance value; assigning the at least one importance value to one of the plurality of DQMs; assigning a rating to each of the plurality of DQMs based on an overhead for computing the data quality metric and a rate of change of the data quality metric as new data points of the data stream are received; and incorporating the at least one importance value assigned to the one of the plurality of DQMs into the rating.
The above-described computer-implemented method provides improvements over known methods of assessing data quality by taking into account an importance value of the DQMs when maintaining information on computational overhead for its DQMs computed over time. Accordingly, the above-described computer-implemented method efficiently allocates computing resources because it computes DQMs and allocates DQM computational resources in a manner that is well matched to incoming data points that are constantly changing over time.
The above-described computer-implemented method can further include specifying a maximum time, (tmax), for updating the plurality of DQM; and ceasing to update the plurality of DQMs after tmax has been exceeded.
The above-described computer-implemented method provides improvements over known methods of assessing data quality by taking into account a maximum time for updating DQMs and ceasing the update when the maximum time has been reached. Accordingly, the above-described computer-implemented method efficiently allocates computing resources because it computes DQMs and allocates DQM computational resources in a manner that is limits time that can be devoted to DQM updates, and in a manner that is well matched to incoming data points that are constantly changing over time.
The above-described computer-implemented method can further include analyzing instances of a DQM computed over a plurality of time intervals; determining that at least one of the plurality of time intervals has an anomalous value for an instance of the DQM computed during the at least one of the plurality of time intervals; computing a weighted aggregate value of an instance of the DQM across the plurality of time intervals; and assigning lower values of the weighted aggregate value to instances of the DQM in time intervals of the plurality of time intervals with the anomalous value.
The above-described computer-implemented method provides improvements over known methods of assessing data quality by identifying anomalous data points over multiple time intervals and assigning a lower weight to data points that are identified as anomalous. Accordingly, the above-described computer-implemented method efficiently allocates computing resources because it identifies anomalous behavior among the data points in a manner that is well matched to incoming data points that are constantly changing over time.
Embodiments of the invention are also directed to computer systems and computer program products having substantially the same features and functionality of the above-described computer-implemented methods.
Additional features and advantages are realized through the techniques described herein. Other embodiments and aspects are described in detail herein. For a better understanding, refer to the description and to the drawings.
The subject matter which is regarded as embodiments is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features and advantages of the embodiments are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
In the accompanying figures and following detailed description of the disclosed embodiments, the various elements illustrated in the figures are provided with three digit reference numbers, where appropriate. The leftmost digit of each reference number corresponds to the figure in which its element is first illustrated.
DETAILED DESCRIPTIONFor the sake of brevity, conventional techniques related to making and using aspects of the invention may or may not be described in detail herein. In particular, various aspects of computing systems and specific computer programs to implement the various technical features described herein are well known. Accordingly, in the interest of brevity, many conventional implementation details are only mentioned briefly herein or are omitted entirely without providing the well-known system and/or process details.
Many of the functional units described in this specification have been labeled as modules. Embodiments of the present invention apply to a wide variety of module implementations. For example, a module can be implemented as a hardware circuit configured to include custom VLSI circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. A module can also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like. A module can also be implemented in software for execution by various types of processors. An identified module of executable code can, for instance, include one or more physical or logical blocks of computer instructions which can, for instance, be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together but can include disparate instructions stored in different locations which, when joined logically together, include the module and achieve the stated purpose for the module.
As previously noted herein, configuring data quality analysis systems to incorporate known techniques for incrementally computing DQMs for new/updated data provides greater computing resource efficiency over data quality analysis system that do not compute DQMs incrementally for new/updated data. However, in known techniques for computing DQMs incrementally, the DQMs themselves are static and do not take into account the fact that, when viewed over selected time windows, the new/updated data from which the DQM is derived is non-static and constantly changing. Accordingly, known techniques for incrementally determining DQMs for new/updated data still lack efficiency in their allocation and use of computing resources because known techniques for incrementally determining DQMs are not well matched to new/updated data that is constantly changing over time.
Embodiments of the invention address and overcome the shortcomings of known incremental DQM computation techniques by providing computer-implemented methods, programmable computing systems, and computer program products configured and arranged to efficiently allocate and utilize the computational resources that are required to perform data quality analysis techniques that compute DQMs incrementally for new/updated data. More specifically, embodiments of the invention improve the efficiency of computational resource allocation and use by providing DQMs and related data quality analysis techniques that are well matched to new/updated data that is constantly changing over time because the DQMs and related data quality analysis techniques in accordance with aspects of the invention are both incremental and non-static. More specifically, the DQMs and related data quality analysis techniques in accordance with aspects of the invention are non-static in that they take into account the fact that, when viewed over selected time windows, the new/updated data from which the DQMs and related data quality analysis techniques are derived is non-static and constantly changing. Accordingly, DQA systems having incremental and non-static data quality analysis features in accordance with aspects of the invention improve the efficiency of computer resource allocation and use over DQA systems that only provide incremental data quality analysis features.
In aspects of the invention, a DQA system having incremental and non-static data quality analysis features in accordance with embodiments of the invention is configured to include a real-time DQA module having multiple computer-implemented sub-modules that perform real-time data quality analysis of changing data by defining multiple incremental DQMs that are changing over time. The multiple sub-modules in the real-time DQA module include a sub-module for computing DQMs for changing data; a sub-module for computing DQMs for constraints that change over time; a sub-module for computing DQMs for anomalous data regions; a sub-module for performing general case incremental computations of DQMs; a sub-module for managing state for incremental DQM computations; and a sub-module for efficiently prioritizing DQM computations.
In embodiments of the invention, the sub-module for computing DQMs for changing data is configured to track changes in data quality that occur over time while also detecting changes in DQMs over multiple time windows. For example, where the DQM is a measure of missing values in a data set (e.g., a data-frame in the Python programming language), the detected changes in the DQM can include missing values that occur over an entire predetermined time interval (or time window); missing values that occurred most recently; and a weighting applied to missing values such that missing values from more recent data points are weighted more heavily. As another example, a DQM (data quality metric) d1 represents a number of possible data quality checks including missing data, finding low-variance variables, averages, standard deviations, medians, checks for constant values, non-repeating values, repeating values, most occurring values, duplicate values across columns, duplicate rows, and the like. In embodiments of the invention, these data quality checks can not only be applied monolithically across all of the incoming new/updated data values but can also be applied to certain time intervals or time windows of the new/updated data. Accordingly, in embodiments of the invention, d1 results from data quality checks defined or applied across specific windows of a data set. In some embodiments of the invention, d1 can be applied during a time interval/window that extends from a start_time to an end_time. In some embodiments of the invention, d1 can be applied to each batch of new/updated data to arrive at d1(batch1), which is computed for each batch of new/updated data. The values of d1(batch1) through d1(batch-N) can be plotted with d1(batch1) through d1(batch-N) on the y-axis and time values on the x-axis. In some embodiments of the invention, data points falling in different time intervals/windows can be given different weights for calculating DQMs. In some embodiments of the invention, the more recent data points (e.g., within a predetermined time interval/window defined as a recent time interval/window) can be assigned higher weights than less recent data points (e.g., within a predetermined time interval/window defined as a less recent time interval/window) for assessing DQMs. In embodiments of the invention, a number of suitable weighting techniques can be used, including but not limited to exponential weighting (including, but not limited to, exponentially weighted moving averages). In some embodiments of the invention, each data point can be assigned a different weight based on its time. In some embodiments of the invention, data points are grouped by time intervals/windows and the same weight can be assigned to a set of data points belonging to the same group. In some embodiments of the invention, older data points can be ignored entirely in the DQM computations. In some embodiments of the invention, a wide variety of known data quality analysis algorithms can be applied to assist with determining the older data points that will be ignored in the DQM computations. Accordingly, as described above, embodiments of the invention provide new and non-static DQMs that are parameterized by time. The DQA system operating in accordance with aspects of the invention is configured and arranged to compute and visualize the new and non-static DQMs over any range of data points.
In embodiments of the invention, the sub-module for computing DQMs for changing constraints is configured to compute DQMs that measure how well data points satisfy a particular constraint of the DCA system. Because such “constraint” (or constrained) DQMs can change over time, instead of computing one constraint DQM for an entire data set, embodiments of the invention compute a constraint DQM for new/updated data points as they come into the DCA pipeline. In embodiments of the invention, analyzing constraint DQMs that change over time enables more complete data quality analyses of data sets. Examples of constraints include null values not exceeding a threshold; averages (e.g., standard deviation, variance, median, and the like) falling within a certain range; two columns of a data-frame having a certain mathematical relationship or correlation; and the like. Embodiments of the invention recognize and leverage the observation that, with real-time streaming data, the applicability of a given constraint is not a static, fixed property. A constraint that is applicable at one time may or may not be applicable for new data that are being received. Accordingly, embodiments of the invention provide analysis of constraints across multiple different time intervals/windows, and the applicability of a constraint or set of constraints in accordance with aspects of the invention is thus dynamic and expected to vary over time. In some embodiments of the invention, if a constraint DQM is changing significantly (e.g., change levels that exceed a predetermined threshold) over time, information related to the changing constraint DQM can be propagated to a user of the DQA system.
In embodiments of the invention, the sub-module for computing DQMs for anomalous data regions is configured to compute DQMs that measure whether or not data points satisfy criteria for determining whether or not a data point qualifies as “anomalous.” In embodiments of the invention, a weight can be assigned to DQMs computed for data points identified as anomalous. In some embodiments of the invention, the weight can be less than one (1) but more than zero (0) based on a computed severity level of the anomaly (e.g., mildly anomalous data can be weighted within about 20% of one (1)). In some embodiments of the invention, anomalous data having a computed severity level over a threshold can be identified as highly anomalous and assigned a weight of zero (0) (i.e., the anomalous data point is ignored when computing DQMs for anomalous data regions).
As an example, in embodiments of the invention where the DQA system uses the Python programming language, a “related” function can be defined as related(feature1, feature2, start_time, end_time). This function returns a value between one (1) and negative one (−1), which indicates a level of correlation between feature1 and feature2 for the time interval/window specified. In embodiments of the invention, the DQA system is configured to maintain related values over several different time intervals. In embodiments of the invention, time intervals/windows are flagged where the related function indicates anomalous behavior in the time interval/window. An example of anomalous behavior for the related function would be the value of related(feature1, feature2, start_time, end_time) changing to values not previously detected by the DQA system. The time intervals/windows where the anomalous data points are flagged at ti and, in accordance with aspects of the invention, data points from the anomalous time interval/window ti can be left out of the DQM computations for ti, or data points from the anomalous time interval/window ti can be assigned a lower weight in computing the DQMs for ti. The DQM can also inform users about anomalous time intervals. Users can provide the DQA system with user-selected preferences about how to treat anomalous time intervals, such as assigning weights to them for computing DQMs.
As another example, in embodiments of the invention, a “missing value” function can be defined in the Python programming language as check_na_columns(df, start_time1, end_time), which checks columns of parameter df for missing values over specific time ranges. In accordance with aspects of the invention, this function can be used to detect the parts of the data should be assigned higher weights in calculating DQMs. For example, if check_na_columns indicates an abnormal proportion of NaN (not a number) values, it may be appropriate to assign less weight to the time interval/window in computing DQMs.
It should be noted that a higher proportion of NaN values does not necessarily mean that a time interval should be assigned a lower weight. In many cases, the DQA system is configured to search for an anomalous number of NaN values, which could indicate an abnormally high or low number of NaN values. If a particular data interval has an unusually low proportion of NaN values compared to other intervals, the DQA system can be configured to interpret this as an indication of an anomaly, which would mean that the interval should be assigned a lower weight than other intervals with a proportion of NaN values that is closer to the mean.
In embodiments of the invention, the sub-module for performing incremental computations of DQMs provides a general case approach to performing incremental computations of DQMs. It is a challenge to make data quality checks incremental in the general case. Embodiments of the invention address and overcome this challenge by providing three types of general case incremental data quality checks, which are defined and identified herein as Type I, Type II, and weighted Type I&II. In embodiments of the invention Type I data quality checks are the data quality checks having corresponding DQMs that can be made incremental in the general case through a decomposition and summation process. For example, where a DQM is defined as q, data coming into the DQA pipeline is organized as data chunks represented as Dw={D1, D2, . . . , Dn} ordered over time, and the quality function is defined as Q(D). The general case incremental data quality check in accordance with aspects of the invention provides a mechanism to decompose the quality function Q by applying Q to each data chunk coming into the DQA pipeline and storing Q({D1, D2, . . . , Dn}) for each data chuck in the data set. When a data chunk Dn+1 of the data set D comes into the DQA pipeline, and the data chunk Dn+1 has new/updated data, Q(D) for the entire data set is D is obtained incrementally by applying Q to the data chunk Dn+1 having new/updated data and combining that result with a summation of the historical results of applying Q to the data chunks {D1, D2, . . . , Dn} in the data set D that have not changed. The summation of the historical Q results is represented by Equation (1) shown in
In embodiments of the invention, Type II data quality checks are the data quality checks having corresponding DQMs that can be made incremental in the general case through the decomposition/summation process used in Type I that has been modified to take into account a decomposable operation used in the associated DQM computation. For example, where the quality function Q of the DQM computation is a percentage of NULL values in a dataset D, the percentage computation Q can be decomposed into two functions P and C, where C counts the total data points in the dataset D, and where P is the NULL values in the data set D. Accordingly, Q(D) can be computed as P(D) divided by C(D). In accordance with embodiments of the invention, data coming into the DQA pipeline is organized as data chunks represented as Dw={D1, D2, . . . , Dn} ordered over time, and the quality function is defined as Q(D). The general case incremental data quality check in accordance with aspects of the invention provides a mechanism to decompose the quality function Q by applying P to each data chunk coming into the DQA pipeline; applying C to each data chunk coming into the DQA pipeline; and storing P({D1, D2, . . . , Dn}) and C({D1, D2, . . . , Dn}) for each data chuck in the data set. When a data chunk Dn+1 of the data set D comes into the DQA pipeline, and the data chunk Dn+1 has new/updated data, Q(D) for the entire data set is D is obtained incrementally by applying P to the data chunk Dn+1 having new/updated data; applying C to the data chunk Dn+1; and combining that result with a summation of the historical results of applying P and C to the data chunks {D1, D2, . . . , Dn} in the data set D that have not changed. The summation of the historical P and C results is represented by Equation (3) shown in
In embodiments of the invention, weighted Type I&II data quality checks are data quality checks where data chunks are weighted under the Type I and II incremental computation scenarios. In some embodiments of the invention, different weights can be applied to different data chunks based on any of the standards for applying weights to data checks previously described herein in connection with aspects of the invention. In some aspects of the invention, a weight w can be selected based on the time elapsed from a last timestamp of a given data chunk to a current time. In some aspects of the invention, uniform of weights can be applied to previous data chunks in a dataset. An example of how weighting can be incorporated into Type I data quality checks in accordance with aspects of the invention is represented by Equation 5 in
In embodiments of the invention, the sub-module for managing state for incremental DQM computations provides a mechanism for maintaining state, which is necessary for making incremental DQM computations because incremental computations require a way of keeping track of the previous computations that will be used in the incremental computations. Embodiments of the invention avoid the need to have the DQA system maintain its own database of historical data quality check information by providing the above-described state management sub-module configured and arranged to interface with user systems in a manner that involves exchanging state variable information back and forth between a user system or program and the DQA system. In embodiments of the invention, the DQA system can be implemented in a programming language, such as the Python programming language, and a Python application program interface (API) of the state management sub-module is configured to enable a user program to use the Python API to call various functions related to state management. A wide variety of other programming languages (e.g., Java, C++, C, and others) can also be used for implementing the DQA system. The state management sub-module computes DQMs and passes state information related thereto to the user program. In order to make use of the state information received from the state management sub-module, the user program calls functions of the state management sub-modules that are configured and arranged to actually analyze the state information. Accordingly, the state management sub-module in accordance with aspects of the invention does not require that a user program have the capability of interpreting the format of the state variables. The user program need only use the programming language API (e.g., Python for a Python implementation of the DQA system) to invoke the state management sub-module of the DQA system.
In embodiments of the invention, the sub-module for efficiently prioritizing DQM computations is configured and arranged to efficiently manage the computational resources of the DQM system that are utilized to execute the various non-static and dynamic data quality checks and DQMs described herein. In embodiments of the invention, computer-implemented methods prioritize the execution of DQM computations based on a variety of factors including but not limited to the computational overhead required to compute a DQM; the rate of change for a given DQM; whether previously computed DQMs can be used without a loss in accuracy that exceeds a threshold; and constraint priorities set by a user. In some embodiments of the invention, machine learning models and/or general simulation algorithms can be utilized to predict the impact of DQMs on computational resources, and the DQA system can prioritize execution of DQMs based on the predictions. Example predictions include but are not limited to a prediction of how much a DQM would be expected to change in response to new/updated data; and, for a given data set and data set size, a prediction of the computational overheads for different DQMs performed on the given data set and data size.
Turning now to a more detailed description of embodiments of the invention,
The DQA system 100 can be configured to include a validator 110, a remediator 120, a set of constraints 130, a DQA pipeline 140, and an update module 150, configured and arranged as shown. The validator 110 is configured to perform multiple types of data quality checks (pre-defined or customized) on multiple types of data. Examples of the types of data quality checks that can be performed by the validator 110 are listed in
The time series data 300 shown in
The DataFrame 400 shown in
The remediator 120 includes the various types of logic that can be called and used to correct or remediate the data quality issues identified by the validator 110. The constraints 130 are customized rules that a user can select and have applied to customize the way data quality checks performed at the validator(s) 110 interpret the different statistical values obtained from the data. The pipeline 140 uses the validator(s) 110, the remediator 120, and the constraints 130 to automate the data quality analysis operations performed by the validator(s) 110, the remediator 120, and the constraints 130 on a single data set.
The update module 150 is configured to update different DQMs after iterations of the data quality analysis processes performed by the DQA system 100. In accordance with embodiments of the invention, the update module 150 includes a real-time data quality analysis (DQA) module 160 configured to implement novel data quality analysis processes in accordance with aspects of the invention. In aspects of the invention, the real-time DQA 160 is configured to perform real-time data quality analysis of changing data by defining multiple non-static and dynamic data quality metrics which are changing over time. Specific details of how the real-time DQA module 160 can be implemented are depicted in
In
Workflow #2 is also represented diagrammatically in
In an example implementation of the sub-module 740, a dataset ds1 received between times t3 and t2 is determined by the sub-module 740 to have a higher or lower proportion of null values than a dataset ds2 received previously between times t1 and t2. Either the higher or lower null value proportion can be considered anomalous depending on the patterns that underlay the determination that a dataset is anomalous. A pattern of low null value proportions followed by a high null value proportion can result in a high null value proportion being flagged as anomalous. A pattern of high null value proportions followed by a low null value proportion can result in the low null value proportion being flagged as anomalous. A pattern of low null value proportions followed by another low null value proportion can still result in the most recent null value proportion being flagged as anomalous if the most recent low null value proportion can still be anomalous if it satisfies another standard for being considered anomalous. The sub-module 740 can either assign a lower weight (i.e., between zero (0) and one (1)) or even ignore d1 in computing the DQMs 710 associated with d1 because of an anomalous shift in the null values associated with d1 and d2 over time (from t1 to t3).
At block 1002, the sub-module 740 assigns weights to different data regions identified in block 1001. In some embodiments of the invention, block 1002 can assign lower weights to anomalous regions.
At block 1003, the sub-module 740 calculates one or more aggregate data quality metrics using the weights computed at block 1002.
The DQA system 100 in which the sub-module 740 operates can handle different modalities of data. For example, one modality could be time series data (e.g., time series data 300 shown in
In embodiments of the invention, the non-static and dynamic DQMs can be tailored to the type of analytics tasks associated with the data-under-analysis. For example, classification tasks can have certain DQMs associated with them, while regression or clustering tasks can have other DQMs associated with them. More generally, a user (e.g., user 502 shown in
Returning to decision block 1308, if the answer to the inquiry at decision block 1308 is yes, the method 1300 proceeds to block 1314, applies the data quality function Q to the new/updated current data chunk, and proceeds to decision block 1316. At decision block 1316, an inquiry is made as to whether or not the current data chunk is the last data chunk in the data set D. If the answer to the inquiry at decision block 1316 is no, the method 1300 returns to block 1306 to process the next data chunk in the data set D. If the answer to the inquiry at decision block 1316 is yes, the method 1300 proceeds to block 1318 and computes the incremental Q(D), which is equal to Q(Previously Computed Data Chunks of D) plus Q (New/Updated Data Chunk(s)). From block 1318, the method 1300 returns to block 1304 to process the next data set.
In accordance with aspects of the invention, block 1318 incrementally applies Q to the entire data set D by applying Q to the data chunk(s) having new/updated data and combining that result with a summation of the historical results of applying Q to the data chunks {D1, D2, . . . , Dn} in the data set D that have not changed. The summation of the historical Q results is represented by Equation (1) shown in
Returning to decision block 1408, if the answer to the inquiry at decision block 1408 is yes, the method 1400 proceeds to block 1414, applies the sub-functions P and C to the new/updated current data chunk then proceeds to decision block 1416. At decision block 1416, an inquiry is made as to whether or not the current data chunk is the last data chunk in the data set D. If the answer to the inquiry at decision block 1416 is no, the method 1400 returns to block 1406 to process the next data chunk in the data set D. If the answer to the inquiry at decision block 1416 is yes, the method 1400 proceeds to block 1418 and computes the incremental Q(D), which is equal to ΣP(Previously Computed Data Chunks of D)/(ΣC(Previously Computed Data Chunks of D) plus C(New/Updated Data Chunk(s)) plus ΣP(New/Updated Data Chunk(s) of D)/(ΣC(Previously Computed Data Chunks of D) plus C(New/Updated Data Chunk(s) of D). From block 1418, the method 1400 returns to block 1404 to process the next data set.
In accordance with aspects of the invention, block 1418 incrementally applies Q to the entire data set D by using the Type I (method 1300) decomposition/summation process that has been modified to take into account a decomposable operation used in the associated DQM computation. In the example method 1400 where the quality function Q of the DQM computation is a percentage of NULL values in a dataset D, the percentage computation Q can be decomposed into two functions P and C, where C counts the total data points in the dataset D, and where P is the NULL values in the data set D. Accordingly, Q(D) can be computed as P(D) divided by C(D). The general case incremental data quality check in the method 1400 in accordance with aspects of the invention provides a mechanism to decompose the quality function Q by applying P to each data chunk coming into the DQA pipeline; applying C to each data chunk coming into the DQA pipeline; and storing P({D1, D2, . . . , Dn}) and C({D1, D2, . . . , Dn}) for each data chuck in the data set. Q(D) for the entire data set is D is obtained incrementally by applying P to the data chunk having new/updated data; applying C to the data chunk having new/updated data; and combining that result with a summation of the historical results of applying P and C to the data chunks {D1, D2, . . . , Dn} in the data set D that have not changed. The summation of the historical P and C results is represented by Equation (3) shown in
In a specific example where the DQA system 100 is implemented in the Python programming language, data regions are passed to the DQA system 100 using Python (e.g. Pandas) dataframes. A DQM implemented as a Python function or method can accept a parameter, interval_info_list, where each element in the list contains information about the data region, such as the relative position of the data region within the entire data set. Other fields corresponding to DQMs for the data region can be included such as proportion of missing values, proportion of infinity values, and proportion of zero values.
For example, a Python function (or method) to check for null values in a data set incrementally can be implemented in the following way. The function updates the null value metrics as new data regions are received. The function can have the following signature: check_na_columns_incremental(df, offset, interval_info_list), where df is the data-frame containing data for the data region; offset represents the relative position of the data region in the entire data set; and interval_info_list includes results from analyses of previous data regions. Accordingly, the function check_na_columns_incremental analyzes df, appends the results to interval_info_list, and returns the updated value of interval_info_list.
In embodiments of the invention, user programs (which can also be referred to as client programs) 766 do not have to deal with the internal structure of interval_info_list variables. User programs 766 can simply pass interval_info_list variables to the DQA system 100 via the API and rely on the DQA system 100 to interpret the interval_info_list variables.
In embodiments of the invention, the sub-module 770 is configured and arranged to maintain information on performance of different DQMs as a function of data size and possibly other characteristics of the data. The sub-module 770 is configured to maintain historical data on the performance of DQMs. As new data sets are analyzed, the sub-module 770 is configured to maintain persistent information on performance and other execution characteristics in a history recorder (HR) (e.g., history recorder 771 shown in
The algorithm of the sub-module 770B begins at block 2501 where the sub-module 770B maintains past statistics on computational overhead, o, and rate of change, f, for different DQMs. The variable f represents the magnitude with which a DQM changes as new data are received. Block 2501 is continuously executed over time. At block 2502, a user (e.g., user 502 shown in
Computer system 2700 includes one or more processors, such as processor 2702. Processor 2702 is connected to a communication infrastructure 2704 (e.g., a communications bus, cross-over bar, or network). Computer system 2700 can include a display interface 2706 that forwards graphics, text, and other data from communication infrastructure 2704 (or from a frame buffer not shown) for display on a display unit 2708. Computer system 2700 also includes a main memory 2710, preferably random access memory (RAM), and can also include a secondary memory 2712. Secondary memory 2712 can include, for example, a hard disk drive 2714 and/or a removable storage drive 2716, representing, for example, a floppy disk drive, a magnetic tape drive, or an optical disk drive. Removable storage drive 2716 reads from and/or writes to a removable storage unit 2718 in a manner well known to those having ordinary skill in the art. Removable storage unit 2718 represents, for example, a floppy disk, a compact disc, a magnetic tape, or an optical disk, flash drive, solid state memory, etc. which is read by and written to by removable storage drive 2716. As will be appreciated, removable storage unit 2718 includes a computer readable medium having stored therein computer software and/or data.
In alternative embodiments, secondary memory 2712 can include other similar means for allowing computer programs or other instructions to be loaded into the computer system. Such means can include, for example, a removable storage unit 2720 and an interface 2722. Examples of such means can include a program package and package interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 2720 and interfaces 2722 which allow software and data to be transferred from the removable storage unit 2720 to computer system 2700.
Computer system 2700 can also include a communications interface 2724. Communications interface 2724 allows software and data to be transferred between the computer system and external devices. Examples of communications interface 2724 can include a modem, a network interface (such as an Ethernet card), a communications port, or a PCM-CIA slot and card, etcetera. Software and data transferred via communications interface 2724 are in the form of signals which can be, for example, electronic, electromagnetic, optical, or other signals capable of being received by communications interface 2724. These signals are provided to communications interface 2724 via communication path (i.e., channel) 2726. Communication path 2726 carries signals and can be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link, and/or other communications channels.
In the present description, the terms “computer program medium,” “computer usable medium,” “computer program product,” and “computer readable medium” are used to generally refer to media such as main memory 2710 and secondary memory 2712, removable storage drive 2716, and a hard disk installed in hard disk drive 2714. Computer programs (also called computer control logic) are stored in main memory 2710 and/or secondary memory 2712. Computer programs can also be received via communications interface 2724. Such computer programs, when run, enable the computer system to perform the features of the invention as discussed herein. In particular, the computer programs, when run, enable processor 2702 to perform the features of the computer system. Accordingly, such computer programs represent controllers of the computer system.
Technical effects and benefits of the disclosed DQA system for performing real-time data quality analysis include but are not limited to the following. Embodiments of the invention provide a DQA system that performs accurate data analytics checks in several problem domains, particularly in applications where new data are constantly being streamed in. The DQA system generate data quality metrics that are constantly updated as new data are received. The DQA system in accordance with aspects of the invention further provide new methods for performing data quality assessment when data are constantly being streamed in.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it defines multiple metrics which are changing over time. Although the disclosed DQA system provides incremental computations for a number of different data quality metrics, it goes significantly beyond past work in defining new metrics which improve upon existing metrics for static data sets. The disclosed DQA system is configured to, when considering new data which is streaming in, define new metrics which are applicable to constantly changing data rather than just using existing metrics. The disclosed DQA system allows data quality checks to be defined across specific windows of a data set.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that different time periods can be given different weights for calculating data quality metrics. In general, more recent data points can be assigned higher weights than less recent data points for assessing data quality metrics. Each sample or data point can be assigned a different weight based on its time. Samples can be grouped by time intervals and a same weight can be assigned to a set of samples belonging to a same group. In some cases, older values can be ignored entirely. Different algorithms can be applied to determine which older values should be ignored.
An additional technical benefit of the DQA system is that it can provide data quality metrics which are parameterized by time. The metrics can be calculated and visualized over any range of data points.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it is configured to provide complete analyses of data sets, including constraints which are applicable to a data set. A DQA system in accordance with aspects of the invention leverage an observation that, with real-time streaming data, applicability of constraints is not a static, fixed property. A constraint may be applicable at one particular time, but not for new data which are being received. The disclosed DQA system accordingly is configured to provide analysis of constraints across multiple time scales. The disclosed DQA system thus treats the applicability of a constraint or set of constraints as dynamic and expected to vary over time.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it maintains related values over several different time intervals, and time intervals where anomalous behavior is detected can be flagged. An example of anomalous behavior would be the value of related data points changing to values not seen before. The disclosed DQA system is configured to disclose time intervals to the user as anomalous. Anomalous time intervals can be left out or assigned a lower weight in calculating overall quality assessments.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it can handle different modalities of data (e.g., time series data and/or tabular data). The disclosed DQA system is configured to provide different data quality metrics for each of type of modality it has been configured to process.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it tailors the data quality metrics to the type of analytics tasks associated with the data. For example, classification tasks have certain data quality metrics associated with them, while regression or clustering tasks have other data quality metrics associated with them. More generally, a user may be performing a specific analytics task, perhaps involving some combination of regression, classification, and clustering. The disclosed DQA system can provide specific data quality metrics suited to such a specific task.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it is configured to maintain state information about previous computations. For example, the state information corresponding to computed data quality metrics can be maintained as the system computes data quality metrics for a new region. This state information can be maintained in a file system or database. For situations in which it is not feasible to use a file system or database to maintain state information, the disclosed DQA system is configured to pass state information between a client program accessing and the disclosed DQA system via an API. In this way, the disclosed DQA system generates the state variables, and once the state variables are created, they are passed between the disclosed DQA system and client programs via the API.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it efficiently manages the trade-offs between efficiency and accuracy of data quality metrics. Achieving the most accurate and up-to-date data quality metrics at all times can have prohibitive overhead. Thus, the disclosed DQA system makes appropriate trade-offs in providing reasonable data quality estimates while not using too many computational resources. The disclosed DQA system is configured to make electronically intelligent choices in both the frequency for recalculating data quality metrics and in selecting the most appropriate data quality metrics to recalculate.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it maintain information on the performance of different data quality metrics as a function of data size and possibly other characteristics of the data. The disclosed DQA system maintains historical data on the performance of the novel data quality metrics generated by the DQA system. As new data sets are analyzed, the disclosed DQA system maintains persistent information on performance and other execution characteristics in a history recorder (HR). The HR is analyzed to better understand the performance of our data quality metrics. The HR maintains information on execution of data quality metrics. Thus, for a given data set and data set size, the disclosed DQA system can estimate the overheads for different data quality metrics performed on that data set.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it maintains information on how data quality metrics change with changes in the data itself. The disclosed DQA system uses this information to predict how much data quality metrics would be expected to change in response to new data. Such change predictions can be made using simple calculations (e.g., using simulation algorithms) or more complex machine learning models.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it limits invocations of data quality metrics with high overhead, and data quality metrics with lower overhead can be executed more frequently.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that it focuses on both the rate of change of data and the data quality metrics themselves. If the rate of change is higher, data quality metrics need to be recalculated more frequently.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that, as more data are received, it can estimate using simple calculations and predictive models how much data quality metrics are expected to change. The disclosed DQA system is configured to recalculate data metrics which are expected to change the most.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that, when it recalculates the novel data quality metrics described herein, updated information is generated on how much the data quality metrics have changed in response to changes in the data. This information can be used to update predictive models on how data quality metrics change with changes in the input data. In this way, as the disclosed DQA system executes, it becomes smarter over time in predicting the behavior of data quality metrics and more accurate in computing performance metrics (with limited computational resources) over time.
An additional technical benefit of a DQA system in accordance with aspects of the invention is that users have the ability to assign an importance score to data quality metrics. A higher importance score indicates that it is more important to have the most up-to-date scores for a data quality metric.
It is understood in advance that although this disclosure includes a detailed description on cloud computing, implementation of the teachings recited herein are not limited to a cloud computing environment. Rather, embodiments of the present invention are capable of being implemented in conjunction with any other type of computing environment now known or later developed.
Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g. networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service. This cloud model may include at least five characteristics, at least three service models, and at least four deployment models.
Characteristics are as Follows:
On-demand self-service: a cloud consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with the service's provider.
Broad network access: capabilities are available over a network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).
Resource pooling: the provider's computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to demand. There is a sense of location independence in that the consumer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter).
Rapid elasticity: capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
Measured service: cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service.
Service Models are as Follows:
Software as a Service (SaaS): the capability provided to the consumer is to use the provider's applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based e-mail). The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
Platform as a Service (PaaS): the capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including networks, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.
Infrastructure as a Service (IaaS): the capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).
Deployment Models are as Follows:
Private cloud: the cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on-premises or off-premises.
Community cloud: the cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on-premises or off-premises.
Public cloud: the cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services.
Hybrid cloud: the cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds).
A cloud computing environment is service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability. At the heart of cloud computing is an infrastructure comprising a network of interconnected nodes.
Referring now to
Referring now to
Hardware and software layer 60 includes hardware and software components. Examples of hardware components include: mainframes 61; RISC (Reduced Instruction Set Computer) architecture based servers 62; servers 63; blade servers 64; storage devices 65; and networks and networking components 66. In some embodiments, software components include network application server software 67 and database software 68.
Virtualization layer 70 provides an abstraction layer from which the following examples of virtual entities may be provided: virtual servers 71; virtual storage 72; virtual networks 73, including virtual private networks; virtual applications and operating systems 74; and virtual clients 75.
In one example, management layer 80 may provide the functions described below. Resource provisioning 81 provides dynamic procurement of computing resources and other resources that are utilized to perform tasks within the cloud computing environment. Metering and Pricing 82 provide cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources. In one example, these resources may comprise application software licenses. Security provides identity verification for cloud consumers and tasks, as well as protection for data and other resources. User portal 83 provides access to the cloud computing environment for consumers and system administrators. Service level management 84 provides cloud computing resource allocation and management such that required service levels are met. Service Level Agreement (SLA) planning and fulfillment 85 provide pre-arrangement for, and procurement of, cloud computing resources for which a future requirement is anticipated in accordance with an SLA.
Workloads layer 90 provides examples of functionality for which the cloud computing environment may be utilized. Examples of workloads and functions which may be provided from this layer include: mapping and navigation 91; software development and lifecycle management 92; virtual classroom education delivery 93; data analytics processing 94; transaction processing 95; and performing efficient real-time data quality analysis 96.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, element components, and/or groups thereof.
The following definitions and abbreviations are to be used for the interpretation of the claims and the specification. As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” “contains” or “containing,” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a composition, a mixture, process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but can include other elements not expressly listed or inherent to such composition, mixture, process, method, article, or apparatus.
Additionally, the term “exemplary” is used herein to mean “serving as an example, instance or illustration.” Any embodiment or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or designs. The terms “at least one” and “one or more” are understood to include any integer number greater than or equal to one, i.e. one, two, three, four, etc. The terms “a plurality” are understood to include any integer number greater than or equal to two, i.e. two, three, four, five, etc. The term “connection” can include both an indirect “connection” and a direct “connection.”
The terms “about,” “substantially,” “approximately,” and variations thereof, are intended to include the degree of error associated with measurement of the particular quantity based upon the equipment available at the time of filing the application. For example, “about” can include a range of ±8% or 5%, or 2% of a given value.
The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instruction by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments described herein.
Claims
1. A computer-implemented method for efficiently assessing data quality metrics, the computer-implemented method comprising:
- receiving, using a processor, a plurality of updates to data points in a data stream;
- providing, using the processor, a plurality of data quality metrics (DQMs);
- maintaining change information on how much the plurality of DQMs are changing over time;
- maintaining overhead information on computational overhead for the plurality of DQMs;
- updating data quality information based on the maintained change information and the maintained overhead information; and
- assigning a rating to each of the plurality of DQMs based on an overhead for computing the data quality metric and a rate of change of the data quality metric as new data points of the data stream are received.
2. The computer-implemented method of claim 1 further comprising receiving at least one importance value.
3. The computer-implemented method of claim 1 further comprising assigning the at least one importance value to one of the plurality of DQMs.
4. The computer-implemented method of claim 14, where the rating incorporates the at least one importance value assigned to the one of the plurality of DQMs.
5. The computer-implemented method of claim 4 further comprising determining the rating, where the rating is determined using a formula comprising: r1=a*o+b*f+c*i, where:
- * is a multiplication operation;
- a is a negative constant;
- b is a positive constant;
- c is a positive constant;
- o is an overhead for computing any one of the plurality of DQMs;
- f is a rate of change of the plurality of DQMs as new data points in the data stream are received; and
- i is an importance of any one of the plurality of DQMs received by the processor.
6. The computer-implemented method of claim 1 further comprising:
- specifying a maximum time (tmax) for updating the plurality of DQMs; and
- ceasing to update the plurality of DQMs after tmax has been exceeded.
7. The computer-implemented method of claim 3 further comprising performing updates to the plurality of DQMs at a frequency that is based on the rating assigned to each of the plurality of DQMs.
8. The computer-implemented method of claim 3 further comprising:
- assigning a ranking to each of the plurality of DQMs based on the rating of each of the plurality of DQMs; and
- updating the plurality of DQMs in an order specified by the ranking assigned to each of the plurality of DQMs.
9. The computer-implemented method of claim 1 further comprising analyzing instances of a DQM computed over a plurality of time intervals.
10. The computer-implemented method of claim 9 further comprising determining that at least one of the plurality of time intervals has an anomalous value for an instance of the DQM computed during the at least one of the plurality of time intervals.
11. The computer-implemented method of claim 10 further computing a weighted aggregate value of an instance of the DQM across the plurality of time intervals.
12. The computer-implemented method of claim 11, where instances of the DQM in time intervals of the plurality of time intervals with the anomalous value are assigned lower values of the weighted aggregate value.
13. The computer-implemented method of claim 1, where the processor is a node of a cloud computing system.
14. A computer system for efficiently assessing data quality, the computer system comprising a memory communicatively coupled to a processor, where the processor is configured to perform operations comprising:
- receiving a plurality of updates to data points in a data stream;
- providing a plurality of data quality metrics (DQMs);
- maintaining change information on how much the plurality of DQMs are changing over time;
- maintaining overhead information on computational overhead for the plurality of DQMs;
- updating data quality information based on the maintained change information and the maintained overhead information;
- specifying a maximum time (tmax) for updating the plurality of DQMs; and
- ceasing to update the plurality of DQMs after tmax has been exceeded.
15. The computer system of claim 14 further comprising receiving at least one importance value and assigning the at least one importance value to one of the plurality of DQMs.
16. The computer system of claim 14 further comprising assigning a rating to each of the plurality of DQMs based on an overhead for computing the data quality metric and a rate of change of the data quality metric as new data points of the data stream are received.
17. The computer system of claim 16, where the rating incorporates the at least one importance value assigned to the one of the plurality of DQMs.
18. The computer system of claim 17 further comprising determining the rating, where the rating is determined using a formula comprising: r1=a*o+b*f+c*i, where:
- * is a multiplication operation;
- a is a negative constant;
- b is a positive constant;
- c is a positive constant;
- o is an overhead for computing any one of the plurality of DQMs;
- f is a rate of change of the plurality of DQMs as new data points in the data stream are received; and
- i is an importance of any one of the plurality of DQMs received by the processor.
19. (canceled)
20. The computer system of claim 16 further comprising performing updates to the plurality of DQMs at a frequency that is based on the rating assigned to each of the plurality of DQMs.
21. The computer system of claim 16 further comprising assigning a ranking to each of the plurality of DQMs based on the rating of each of the plurality of DQMs.
22. The computer system of claim 14 further comprising:
- analyzing instances of a DQM computed over a plurality of time intervals;
- determining that at least one of the plurality of time intervals has an anomalous value for an instance of the DQM computed during the at least one of the plurality of time intervals; and
- computing a weighted aggregate value of an instance of the DQM across the plurality of time intervals;
- where instances of the DQM in time intervals of the plurality of time intervals with the anomalous value are assigned lower values of the weighted aggregate value.
23. A computer program product for efficiently assessing data quality, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor system to cause the processor system to perform operations comprising:
- receiving a plurality of updates to data points in a data stream;
- providing a plurality of data quality metrics (DQMs);
- maintaining change information on how much the plurality of DQMs are changing over time;
- maintaining overhead information on computational overhead for the plurality of DQMs;
- updating data quality information based on the maintained information;
- specifying a maximum time (tmax) for updating the plurality of DQMs; and
- ceasing to update the plurality of DQMs after tmax has been exceeded.
24. The computer program product of claim 23, where the operations further comprise:
- receiving at least one importance value and assigning the at least one importance value to one of the plurality of DQMs; and
- assigning a rating to each of the plurality of DQMs based on an overhead for computing the data quality metric and a rate of change of the data quality metric as new data points of the data stream are received;
- where the rating incorporates the at least one importance value assigned to the one of the plurality of DQMs.
25. The computer program product of claim 24, where the operations further comprise determining the rating, where the rating is determined using a formula comprising: r1=a*o+b*f+c*i, where:
- * is a multiplication operation;
- a is a negative constant;
- b is a positive constant;
- c is a positive constant;
- o is an overhead for computing any one of the plurality of DQMs;
- f is a rate of change of the plurality of DQMs as new data points in the data stream are received; and
- i is an importance of any one of the plurality of DQMs received by the processor.
26. The computer system of claim 21 further comprising updating the plurality of DQMs in an order specified by the ranking assigned to each of the plurality of DQMs.
Type: Application
Filed: Jul 31, 2020
Publication Date: Feb 3, 2022
Patent Grant number: 11263103
Inventors: Arun Kwangil Iyengar (Yorktown Heights, NY), Anuradha Bhamidipaty (Yorktown Heights, NY), Dhavalkumar C. Patel (White Plains, NY), Shrey Shrivastava (White Plains, NY), Nianjun Zhou (Chappaqua, NY)
Application Number: 16/944,715