RESOURCE-AWARE AND ADAPTIVE ROBUSTNESS AGAINST CONCEPT DRIFT IN MACHINE LEARNING MODELS FOR STREAMING SYSTEMS

Info

Publication number: 20210224696
Type: Application
Filed: Jan 20, 2021
Publication Date: Jul 22, 2021
Inventors: Mohamad Mehdi Nasr-Azadani (Menlo Park, CA), Andrew Hoonsik Nam (San Francisco, CA), Teresa Sheausan Tung (Tustin, CA)
Application Number: 17/153,237

Abstract

Complex computer system architectures are described for detecting a concept drift of a machine learning model in a production environment, for adaptive optimization of the concept drift detection, for extracting embedded features associated with the concept drift using a shadow learner, and for adaptive adjustment of the machine learning model in production to mitigate the effect of predictive performance drop due to the concept drift.

Description

Description

CROSS REFERENCE

This application claims priority to U.S. Provisional Patent Application No. 62/963,961 filed Jan. 21, 2020 and U.S. Provisional patent Application No. 62/966,410 filed on Jan. 27, 2020, which are herein incorporated by reference in their entireties.

TECHNICAL FIELD

This disclosure relates to machine learning model production and model adaptation.

BACKGROUND

Machine learning models are trained to extract features and automate tasks previously attributed to humans with increasing efficiency and accuracy. The machine learning models may be trained and developed based on a set of training data encompassing underlying data rules, data correlations, data relationships, and data distributions. Such rules, correlations, relationships, and distributions may change or drift over time (referred to as “concept drift”), leading to a drop in the performance of the machine learning models.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A-1B illustrate using an example machine learning model to perform prediction of user response to online advertisement.

FIG. 2 illustrates exemplary consequence of concept drift in machine learning models.

FIGS. 3-5 illustrate exemplary underlying principles of concept drift in machine learning models.

FIG. 6 illustrates an exemplary concept drift in machine learning models.

FIGS. 7A-7F illustrate various exemplary forms of concept drift of machine learning models represented by a change of predictive accuracy over time.

FIG. 8 illustrates an exemplary implementation of an adaptive concept drift detection and correction engine in operation with a production environment for machine learning models.

FIG. 9 illustrates a data/logic flow and block diagram of an exemplary adaptive concept drift detection and correction engine in operation with a production environment for machine learning models.

FIG. 10 shows exemplary parameters and design choices for consideration in designing an architecture for an adaptive concept drift detection and correction engine and its various components.

FIG. 11 illustrates a data/logic flow and block diagram of an exemplary concept drift detection architecture.

FIG. 12 shows an exemplary ensemble concept drift detector.

FIG. 13 shows a/logic flow and block diagram of an exemplary concept drift detector designer.

FIG. 14 illustrates an exemplary implementation of deriving explanation underlying an adaptive concept drift of a machine learning model using a shadow learner.

FIG. 15 illustrates an exemplary architecture for using an ensemble of machine learning models as a production model.

FIG. 16 illustrates an exemplary algorithm and corresponding data/logic flow for an ensemble model school or model library.

FIG. 17 shows an exemplary architecture for a computer device used to implement the functions of various components of the concept drift detection and correction system of FIGS. 1-16.

DETAILED DESCRIPTION

Defining, building, deploying, testing, monitoring, and updating machine learning models in a production environment poses a technical challenge on multiple disciplinary levels. For example, addressing the technical challenges that arise to effectively manage machine learning models and maintain their predictive accuracy in their production environments require a diverse skillset including but not limited to business intelligence, domain knowledge, machine learning and data science, software development, DevOps (software development (Dev) and information technology operations (Ops), QA (quality assurance), integrated engineering, software development, and/or data engineering.

In this disclosure, complex computer system architectures are described for monitoring and detecting concept drift in machine learning models and for effectuating adaptive adjustment to these machine learning models in production in accordance with the detected concept drift. For example, a plug-and-play engine is provided for integration with a machine learning model production environment to implement the concept drift detection and model adaptation.

Machine learning models based on, e.g., various classification and/or regression algorithms, and or deep learning neural network architectures, may be trained based on a set of training data containing a plurality of data items. Each of these data items may be associated with various data characteristics (input data properties to the machine learning model). The plurality of data items may be distributed according to certain data distributions over these various data characteristics. The plurality of data items may be further associated with one or more target variables (output of the machine learning model). The one or more target variables may exhibit certain data relationship with the various characteristics of the training data items. Such data relationship may be learned using various machine learning algorithms, resulting in a trained machine learning model. Such a machine learning model may be represented by a set of model parameters and may be used to predict target variables for an input data item whose target variables are unknown. The model parameters of the trained machine learning model are determined by the underlying data correlations, distributions and relationships embedded in the training dataset.

In some applications such as those involving live streaming data, input data that are processed by the machine learning model may continuously evolve in time. Accordingly, as shown in FIGS. 1-5, the underlying data correlations, distributions, and/or relationships may change or shift away from those of the original training dataset. As a result, the trained machine model may become stale and inaccurate in predicting the target variables for new incoming data items.

For example, FIG. 1A and FIG. 1B illustrate operations of an example machine learning model 102 trained using a set of static training data at a particular time frame for predicting whether a user is likely to click a particular type of online advertisement on mobile devices 106. The machine learning model 102 may be trained to identify/extract from the static training dataset a correlation between users of various characteristics (including, for example, age, gender, state of residence, and times of the day when various advertisements of are presented to the users) and whether they would click advertisements of various properties. Once the machine learning model 102 is trained, it may be used to predict the probabilities 112 and 114 that users 108 and 110 with a particular set of characteristics 104 and 105 are likely to click a particular advertisement or a particular type of advertisement. For example, the machine learning model 102 may predict that the user 108 with characteristics 104 is likely to click a particular advertisement whereas the user 110 having characteristics 105 is unlikely to click the same advertisement. In the example above, the input to the machine learning model 102 may include, for example, characteristics of a user and properties of the advertisement. The predictive output of the model, alternatively referred to as a target variable, may represent a probability that the input advertisement will or will not be clicked by the user given the input user characteristics.

However, the input-output correlation as extracted from the static training data and embodied in the trained machine learning model 102 may evolve or shift over time. As a result, the machine learning model 102 may gradually become stale. For example, as the users age, correlation or underlying data relationships between user characteristics such as distribution of demographics of the users and user online behavior (likelihood of clicking an online advertisement) may change over time. In particular, as time evolves, demographic groups of older ages may become increasingly more likely to click a particular type of advertisement. As shown by the example 200 in FIG. 2, particular demographical groups 202 and 204 may be unlikely to click a particular type of online advertisement in an earlier time frame 206 in comparison to other demographical groups, as shown by 210 of FIG. 2. At a later time frame 208, the same demographical groups 202 and 204 may become more likely to click the same type of online advertisement, as shown by 212 of FIG. 2. Such changes, shifts, or evolution of the underlying data distribution, correlation, or data relationship may be referred to as concept drift. Because the machine learning model is typically trained on a static set of historical training data, its predictive accuracy or other performance metrics in processing new input data may decrease due to a concept draft. For example, a machine learning model trained using a set of training data collected during the time frame 206 would predict inaccurately during time frame 208, particularly with respect to the demographical groups of 202 and 204.

FIGS. 3-5 particularly illustrate the effect of concept drift on the time evolution of the predictive accuracy of a machine learning model that classifies input data (characterized by one or more features x₁302 and x₂304) into two classes (“class 1” 306 representing “click” of advertisement and “class 2” 308 representing “no-click” of advertisement, as delineated on the two sides of the class boundary lines 310, 312, 410, and 510). FIG. 3 shows that during a time frame when a training dataset is collected and the machine learning model is trained, the classification decision boundary 312 of input data in the feature diagram and as embedded in the machine learning model approximately reflects a true boundary line 310. The machine learning model thus can make a relatively accurate prediction of the classification of an input data item during this time frame. However, as a result of concept drift, as shown in FIG. 4 and FIG. 5, the trained classification boundary line 310 of the machine learning model does not accurately reflect true boundary lines 410 and 510 of input data items at later time frames. The predictive accuracy of the machine learning model thus may deteriorate with time as a consequence of the concept drift.

As an example, FIG. 6 further illustrates a predictive accuracy drop 602 of a machine learning model as a function of time 604 due to concept drift. Such accuracy drop may be determined by comparing the prediction of the machine learning model in production and reality (e.g., whether a user has actually clicked a particular online advertisement or not in the example online streaming application above). The accuracy of the machine learning model is merely one of many metrics that may be used to represent the performance quality of the model. Time evolution of any other single or combinational metrics may be used as an indicator of the concept drift for the machine learning model.

FIGS. 7A-7F further shows various forms in which performance metrics of the machine learning model may evolve as a function of time as a result of the presence or lack of presence of concept drift, including but not limited to sudden concept drift (FIG. 7C), transitional concept drift (FIG. 7E), recurring concept drift (e.g., seasonality, FIG. 7D), and incremental concept drift (FIG. 7F), as compared to the situation where there is no observable concept drift (FIG. 7A) and the situation where there are merely outliers or noise (FIG. 7B). The presence of one or more of these exemplary forms of concept drift that then trigger an adaptive correction of the machine learning model to improve or restore its predictive accuracy may be determined/detected according to various factors and parameters as described in further detail in the example implementations below.

A machine learning model, once trained, may be deployed in a production environment. Such a machine learning model, as described above, may thus suffer from the consequences of concept drift and may need to be corrected over time in order to maintain desired performance metrics. Rather than re-engineering the model features and retaining the machine learning model using up-to-date training dataset which could involve an elaborated and time-consuming process of building the model from scratch by data scientist, business and operational personal, the various implementations below are directed to an adaptive concept drift engine (ACDE) that may be used as a plug-and-play component in the production environment of the machine learning model for detecting concept drift and for adaptively adjusting the machine learning model over time. The disclosure below is further related to U.S. application Ser. No. 16/891,980, filed by the same applicant of this application on Jun. 3, 2020, and U.S. application Ser. No. 16/749,717, filed by the same applicant on Jan. 22, 2020, which are incorporated by reference in their entireties.

FIG. 8 illustrates an exemplary implementation of the ACDE 800 in operation with a production environment 801 for a machine learning model. FIG. 8 shows a production machine learning model that predicts a time sequence of target variables (y¹, y², . . . , y^t) 804 based on a time sequence of input data (x¹, x², . . . , x¹) 802. The adaptive concept drift engine ACDE 800 as illustrated in FIG. 8 may include, for example, components including but not limited to a concept drift detection (DD) engine 810, a detector evaluator (DE) engine 812, a reasoning and explainability (RE) engine 814, a model school (MS) 816, and a data management (DM) engine 818. For example, the DD engine 810 may be responsible for determining whether a concept drift has occurred and for triggering other activities in the ACDE 800. The DE engine 812 may be responsible for monitoring the performance quality of the concept drift detector engine 810. The RE engine 814 may be responsible for identifying why, how, and when the concept drift has occurred and converting the reason behind the concept drift into explanation understandable by human operators. The MS 816 may be responsible for building and managing machine learning models that may provide model adaptability to the machine learning module production environment 801 such that the consequences of concept drift are adaptive and timely mitigated and the performance metrics of the production machine learning model is maintained over time, without having to retrain and/or rebuild the machine learning model from scratch. The DM engine 818 may be responsible for managing the various data needed for the ACDE 800 considering various computing resource limitations and constraints. The ACDE 800 communicates with the production environment 801 of the machine learning model for obtaining prediction from the machine learning model, as shown by arrow 820, and for updating the machine learning model to correct the detected concept drift, as shown by arrow 830. The ACDE 800 may further be provided with an interactive interface to accept commands from and send notices/alerts to operator 840, as shown by 850.

Various exemplary metrics and parameter spaces may be used in the various components of the ACDE 800 of FIG. 8. These metrics and parameters may be pre-configured in the design of the ACDE 800 and may be modifiable by the operator of the ACDE 800 via the interactive user interface 850. For example, the DD engine 810 may be designed based on detecting concept drift based on predictive precision or accuracy. Further, the detection threshold for determining the occurrence of concept drift may be based on a sensitivity requirement (e.g., how does the DD engine 810 respond to noises and outliers as shown in FIG. 7B) and robustness of the DD engine 810 (e.g., tolerance of misidentification of concept drift). The DD engine 810 may include a single concept drift detector, or may include an ensemble of concept drift detectors for statistical detection of various forms or types of concept drift. Further, a maximum delay (delay threshold) for the DD engine 810 in detecting concept drift may be specified or configured. Such a delay threshold may be determined, for example, based on computing resources available to the DD engine 810.

For another example, the DE engine 812 may be implemented by analyzing a single or an ensemble metrics including but not limited to precision, recall, accuracy, and distribution with respect to the concept drift detector 810. In the DE engine 812, the timing for the labeling of the input data may be specified as instant or delayed (true labels for the input data to the machine learning model may be provided instantaneously or delayed). Specifically in the example of the online streaming application above, whether the user has actually clicked a particular online advertisement may be obtained instantaneously or may be delayed. In some other implementations, the labeling may be unsupervised such that there is no true labels.

For another example, the RE engine 814 may include a shallow learner to produce an auxiliary machine learning model (such as a Hoeffding tree) that may be used to extract and determine data features that has drifted. In some implementations, the RE engine 814 may include an ensemble of shallow learners. The training strategy of these shadow learners may be predetermined. The output of the RE engine 814 may be a set of reasoning types for the detected concept drift.

For another example, the MS 816 may include machine learning models trained at various times and contain a library of machine learning models. Triggering for training a new model to be included in the model library may be performed on demand or be automated periodically (scheduled). The training of these machine learning models may be determined by re-training strategies represented by a retaining recipe. The library of models may include user models trained at various times and may also include shadow models trained by the MS 816. The MS 816 may be deployed to provide either an ensemble of machine learning models or a single model that are adaptively selected from the models in the model library to the production environment 801 to address detected concept drift. The training of the various machine learning models in the model library may be supervised (with true label) or unsupervised (without true label). These models may be further validated prior to becoming available for the MS 816. Validation of the machine learning model may be of various types. A validation type represents how an updated (re-trained) model may be verified against unseen (newly streamed) data. The validation process provides assurance of the functionality of the various machine learning models when they are built, and further sheds light on the generalizability of these model against unseen data. In some implementations, cross-validation may be performed. A cross validation process, for example, splits data into two batches: (a) training data to build the model, and (b) test data to verify the updated model. A machine learning model may be first trained and then tested against test data (performed k-times with data split permutations, k >1). In some implementations, a holdout process similar to cross validation may be used, where k equals to 1, as described in more detail below. Further, a prequential process for test-then-train scenario may be used, where the newly arrived data is first used for prediction, and will immediately be utilized to train (or update) the model.

For yet another example, the DM engine 818 may be based on a set of data and or concept storage/forgetting strategies as determined by computing resource (e.g., hardware) limitations such as memory and storage resource constraints. In other words, various data/concepts that should remain in storage and the data that should be removed from the memory may be based on such strategies. The DM engine 818 may also operate under a predetermined data sampling frequency and data time window size.

FIG. 9 illustrates a logic/data flow and block diagram of an exemplary ACDE in operation with a production environment 900 for machine learning models. FIG. 9 shows that the machine learning model production environment 900 may include a current production model (or model ensemble) 902 that is used for prediction. The production environment 900 further contains the ACDE including a DD (drift detector) engine 906, an RE (reasoning/explainability) engine 910, a MS (model school) 904, and a DM (data manager) engine 908 interconnected as shown in FIG. 9. The input streaming data 901 may be received by the DM engine 908, the RE engine 910, and the deployed production machine learning model or model assemble 902 (referred to production machine learning model or production model). The production model 902 performs prediction based on the input data 901 and the RE engine 910 performs analytics to identify explanation of changes in the embedded correlation, rules or data relations in the input data. The output of the production model 902 and the RE engine 910 may be stored in the DM engine 908. The output of the production model 902 is further sent to the DD engine 906, which performs drift detection. The output from the DD engine 906 may be sent to the DM engine 908 for data storage/management. The MS 904 receives information from the DM engine 908 (including historical model data, detector data) and may be triggered to train new machine learning models to be included in the MS based on predetermined training strategies. The MS 904 further provides single or ensemble of machine learning models to the production model 902 for adaptively adjusting the currently deployed machine learning model or models used in production for live prediction. The model adaptation may be triggered on-demand, or may be pre-scheduled. The training of the machine learning models in the MS 904 and the adjustment of the production model 902 from the MS 904 may be based on a set of learning/updating strategies involving, for example, data sampling windows for retraining and whether the models added to the MS 904 should be fixed or adaptive. The DD engine 906 may further include a detector evaluator (DE) (show below in FIG. 11). The DE may use the detection output of the DD engine 906 and evaluate the performance of the DD engine 906. The evaluation may be sent to the DM engine 908. Such evaluation and other data from the DM engine 908 may be used by the DD engine 906 to adaptively reconfigure itself to improve concept drift detection based on various concept drift detection metrics.

Various exemplary design parameters and design choices may be taken into consideration in designing an architecture for the ACDE and its various components in FIGS. 8 and 9. Example set of types of parameters and design choices and options therewithin are shown in Table 1 and correspondingly in FIG. 10. These parameters and design choice are further explained in detail in the description below for each individual component of the ACDE. FIG. 10 particularly illustrates various configurable parameters and design choices 1004 for a set of parameters and design choice types 1002, and an example configuration (a particular combination of these parameters and design choices) as indicated by the arrows 1010, 1012, 1014, 1016, and 1018.

TABLE 1 Retraining Model error Model storage frequency Model architecture calculation Model update & forgetting in MS type mode mode mode Never: Constant: Instant- None: Never: Same initial Use initial model supervised: Never update Never store the model used architecture (e.g., Newly streamed the model model initial neural data's true network labels/values architecture) are obtained instantly Preset: Shadow learning: Delayed- Partial Preset: Occurs at Alternate model(s) supervised: update: Rule-based pre-specified trained in the Newly streamed Update storing/forgetting times background on newly data's true original model of model/weak streamed data. Upon labels/values using newly learners request, shadow are obtained streamed data model can after a grace replace/augment period main model On- Ensemble Semi- Global On-demand: demand: learning: supervised: update: Occurs upon occurs upon A set of weak Only a subset of Retrain same request by a drift request by a learners trained in newly streamed model using all detector module, drift detector the background or data's true collected and resource planner module or on-demand labels/values streamed or human human are obtained dataset operator operator Online: Dynamic: Unsupervised: Complete Online: Occurs for Explore fully new Newly streamed replacement: Set by drift every newly model architecture, data's true Choose a new detection streamed new data features, labels/values model precision & dataset and new training are unknown architecture, resource data Features, and requirements, retrain using rebalance weak all learners' accumulated contribution data. automatically

In some implementations of the ACDE, required and available computing resources may be taken into consideration in the selection or configuration of the parameters and design choice above. Computing resources to be considered may include but are not limited to the amount memory, processing power, I/O capabilities, storage size, and encryption requirement. The various ACDE architectural options described above correspond to different resource demands. For example, sizes for various buffers needed in the various components of the ACDE may depend on streaming data rate, sampling window size, and sizes of machine learning models, and size of the concept drift detector or detector ensemble. Computing resource requirement for implementing the ACDE further depends on particular algorithm included in the machine learning models in the model school and the desired prediction accuracy of these models. Computing resource need further depends on data labeling requirement for concept drift detection and model retraining in the ACDE, e.g., whether the concept drift detection process and model retraining process are supervised, semi-supervised, or unsupervised, and whether the labeling process need to be instant, on demand, or can be delayed.

The computing resource availability, limitation, or constraint for implementation the ACDE may be translated into a set of meaningful parameters and optimization objectives that are further used to effectuate selection and configuration of various architectural and implementation choices of the ACDE and its components as described above in Table 1 and FIG. 10. For example, the design of the concept drift detection and adaptation components of the ACDE may be transformed into a search of optimization problem. In an example implementation, for a detection of potential concept drift in a machining learning classifier that periodically reads streaming sensor inputs to predict whether a machinery system is in a health operation condition, the ACDE may be constrained to a maximum RAM size (e.g., 100 Mbyte). Such a constraint in RAM size may be translated into real design parameters that will be fed into an optimization algorithm (e.g. a Multi-Objective Bayesian Optimization Genetic Algorithm, or MOBOGA, as disclosed in more detail in afore-mentioned Provisional Application No. 62/913,554 and described in more detail below) to provide the best solution, e.g., best combination of detection algorithm and number of detectors in a detector ensemble in terms of accuracy and other metrics, within the given constraint.

The concept drift detection engine of FIGS. 8 and 9 may be based on single concept drive detector or an ensemble of concept drift detectors. The purpose of using a detector ensemble, for example, is to provide a flexibility to select multiple detectors of various characteristics that collectively provide more accurate detection of different types of concept drive for a particular application. The detection of concept drift may be based on comparison of predictions by the currently deployed production model and actual labels of the input streaming data. Each of the concept drift detectors may be based on various detection algorithms including but not limited to Drift Detection Method (DDM), Early Drift Detection Method (EDDM), Adaptive Windowing (ADWIN) detection, Page-Hinkely (P-H) statistical detection, and CUmulative SUM (CUSUM) detection. The detection output may be based on single or combinational metrics including but not limited to prediction accuracy, prediction precision, and detection delay. An ensemble detection approach, as described in more detail below, may facilitate increasing detection precision at the expense of, for example, increased detection delay due to added processing complexity. The detection engine performs a detection procedure to detect a concept drift of a machine learning model using the any combination of the approaches above.

The design and optimization of the concept drift detection engine of the ACDE of FIGS. 8 and 9 may be governed by a multitude of high-level design strategies. For example, the concept drift may be defined or quantified either according to changes in the characteristics of input data, output data from the deployed production machine learning model, or both. For a specific example, the concept drift may be defined as reduction of some metric(s) of the production model due to change in probability distribution of input data (for reasons that may not be necessarily known yet). The metric(s) may be used as label(s) for the concept drift. The threshold(s) of the metric(s) that indicate concept drift may be preconfigured or may be set up or modified by user via a user interface of the ACDE. The threshold(s) may be relative (percentage drop) or absolute. The actual labels are provided using a labeler employing either supervised or unsupervised method. The actual labels may be provided as being generated by the user of the production model (e.g., when the user actually click a particular advertisement or move on to a next user interface without clicking the particular advertisement in the online application described above). An optimization engine may be employed to improve the concept drift detector based on an evaluation of the performance of the concept drift detector. The optimization may be made by performing multiple iterations by the optimization engine and the optimization engine may identify and recommend a detector configuration for a next detection period. The optimization engine for the concept drift detector may be based on, for example, MOBOGA). The optimization engine for the concept drift detector may determine what data to use to design next detector configuration. The detector configuration may be updated if needed in each detection period. The drift detector design and optimization may be based on a hierarchical infusion of cross-sector goals including practical, technical, and operational goals. These goals may be parametrized into model performance metric(s) (such as detection accuracy, precision, and delay). The detector may be designed and optimized to maintain these goals.

FIG. 11 illustrates an exemplary algorithm and corresponding implementation of data/logic flow and block diagram 1100 for a concept drift detection architecture. Specifically, the concept drift detection architecture 1100 may include a concept drift detector 1102, a detector evaluator 1104, a data labeler 1106, a model evaluator 1108, and a detector designer 1110 that interact with the production model (or model ensemble) 1112 and the data manager (DM) 1114. As further shown by FIG. 11, the input data 1116 may be streamed to the production model (or model ensemble) 1112, the data labeler 1106, and the data manager 1114. The production model (or model ensemble) 1112 may perform prediction. The input data stream 1116 may be labeled during a detection period by the data labeler 1106. The output of the production model 1112 and the data labeler 1106 may be analyzed by the model evaluator 1108. The output of the model evaluator 1108 may be used at the end of the detection period by the concept drift detector 1102 for the detection of concept drift. The output of the concept drift detector 1102 may be stored in data manager 1114 and may also be input into the detector evaluator 1104 and may be used to assess the drift detector's performance in current detection period by the detector evaluator 1104. The output of the detector evaluator 1104 may be stored in the data manager 1114. The detector designer 1110 may obtain information from the data manager 1114. An optimization engine in the detector designer 1110 as described in further detail below may be used to generate a new detector configuration based on the detector evaluator 1104. The detector designer 1110 and its optimization engine may perform the optimization based on historical data for the concept drift detector 1102, the detector evaluator 1104, and the production model 1112 as stored in the data manager 1114. In addition, the detector designer 1110 may include performance metrics and hardware constraints (e.g., computing and storage resources, as described above) in generating an optimal detector configuration for the next detection period. The above process repeats from detection period to detection period to maintain an optimal detector configuration and performance under a particular computing resource constraint.

As further shown in the exemplary concept drift detection architecture 1100 of FIG. 11, in some implementations, the data/logic flow and architecture 1100 may be divided into an online flow 1140, and on-demand flow 1150. For example, the drift detection and evaluation may be performed continuously online as streaming input data 1116 is being processed whereas the detector designing process may be performed on-demand or on different timescale for updating the concept drift detectors in the concept drift detection engine 1102.

FIG. 12 shows an exemplary ensemble concept drift detector 1200. As shown in FIG. 12, the ensemble concept drift detector 1200 may include a plurality of individual concept drift detectors 1204 for generating individual detection outputs 1205 which are weighted using weights 1206 and combined (1208) to generate an overall detection output. The prediction or detection results of the detection branches associated with the set of detectors 1204 may be then held (or delayed with configurable delays) prior to being combined, as shown by 1207. The ensemble drift detector may be based on binary detection based on a threshold level of the overall weighted detection output according to a set of combination detection rules, as shown by 1208. The set of combination rules in 1208 may be based on, for example a business and/or technology-aware rule set. The ensemble detector 1200 of FIG. 12 thus may be characterized by a number of detectors 1204, a set of weights 1206, and the overall threshold for the detection of concept drift 1208. The various individual detectors in the detector ensemble may be chosen to include a diverse set of detectors 1204 that are advantageous in various different aspects and provide improved accuracy and robustness of the overall detection. The set of detectors 1204 may be determined by the optimization engine of the detector designer 1110 of FIG. 11 by optimizing an overall detection accuracy, precision, detection delay, detection sensitivity and robustness. Both the output from the individual detectors 1204 and the overall detection output may be provided as alerts to the user of the ACDE and other components of the ACDE via a user interface of the ACDE, as shown by 1160 of FIG. 11. Each of the concept drift detectors of the ensemble of detectors 1204 of FIG. 12 forms a detection path as shown as the various detection branches of FIG. 12 and each detection path is associated with a weight parameter 1206 and detection delay 1207.

The concept drift detector evaluator 1104 in FIG. 11 for evaluating the performance of the concept drift detector 1102 may be implemented in various manners. The concept drift detector evaluator 1104 may be configured to quantify the delay, accuracy/precision, and other metrics of the concept drift detector 1102. For example, the concept drift detector evaluator 1104 may be configured to compare a time when the drift detector alerts the user of a concept drift to a time when the drift was defined as having occurred to evaluate the delay or latency characteristics of the concept drift detector 1102. Further, different drift detection techniques provide different levels of sensitivity and robustness against occurrence of concept drift. Hence, they identify a concept drift in the model and newly streamed data at various time steps (or different delay values). The detector evaluators 1104 may be designed to analyze such sensitivities. The ensemble detection algorithm above in combination with the detector evaluator 1104 particularly benefits from combing the results of various detection techniques as guided by the detector evaluator 1104 to be as close as possible to the true occurrence of concept drift.

FIG. 13 further illustrates a data/logic flow and block diagram of an exemplary concept drift detector designer 1110 of FIG. 11. The main component of the detector designer 1110 includes an optimization engine 1302. The optimization engine 1302 may be implemented as a multi-objective optimization engine, such as the MOBOGA implementation described above. The optimization engine 1302 may receive detector history data 1301 from the data manager (DM, 1114 of FIG. 11) and perform the optimization to output a candidate detector configuration characterized by detector configuration parameters such as the number of individual detectors in the detector ensemble, types of the individual detectors, and detector weights for combining detection outputs from the individual detectors into the overall detection output. The detector designer 1110 may further include a simulator 1304 for simulating and evaluating the candidate detector configuration generated by the optimization engine 1302 and provided to the simulator 1304 (as shown by arrow 1306). The evaluation may be based on predefined detector metrics (including but not limited to detection delay, detection accuracy, detection precision, and latency) by comparing with sampled concept drift data from the data manager, as shown by 1320. The evaluation result may be provided back to the optimization engine 1302 for re-optimization, as shown by arrow 1308. The optimization process above may repeat for a predetermined number of iterations or may reiterate until the simulator provides a set of evaluation metrics that satisfy predetermined threshold levels, as shown by the arrows 1306 and 1308. The candidate detector configurations generated through the various iterations above may then be compared and a best detector configuration 1310 may be selected and used for the next detection period.

The data manager, such as the DM 1114 of FIG. 11, may be configured to store streaming data and their labels, the model metrics (for prediction of the production models) and the detector metrics (for the performance of historical concept drift detectors). As described above, strategies for the amount of data to store, the update of new data, removal of data (forgetting strategy), sampling parameters, and data window size for various components of the ACDE may be determined based on limits and constrains of hardware computing resources. Such hardware constraints may include but are not limited to storage spaces, memory size, hardware redundancy requirement, network bandwidth, and processing throughput and latency. The data manager 1114 may be implemented based on various types of database technologies and may be configured with interface for sending data requests and receiving returned data by other components of the ACDE.

Turning to the reasoning and explainability (RE) engine 910 of the ACDE of FIG. 9, such an RE engine 910 may be implemented for determining when, how, and why a concept drift occurs, and from such determination, for extracting information such as the type of the concept drift, how can the concept drift be understood intuitively, and the time frame over which the concept drift occurs. As further shown in FIG. 14, the RE engine 910 may be implemented based on shadow learning. In particular, the RE engine 910 may be responsible for building one or more shadow models based on, for example, Hoeffding Trees or their variants such as Concept-Adapting Very Fast Decision Trees (CVFDT) to provide explanation to the detected concept drift. As shown in FIG. 14, for a classifier application, CVFDTs may be built by the RE engine 910 as a shadow model using the input streaming data with labels (from the labeler described above, for example). The CVFDTs may be built at various time frames (as illustrated as Time=N 1402, and Time=N+1 1404 in FIG. 14). The change in the CVFDTs from time frame 1402 to time frame 1404 may be extracted to indicate the reasons for the detected concept drift. In the example of FIG. 14, the shadow learner is constructed to generate CVFDT classification tree of various features in the machine learning model at different time frames. The two consecutive CVFDTs of FIG. 14 show that the second split feature in the trees has changed from the N's time frame to the N+1's time frame and the change in importance of model features likely has caused the detected concept drift. Such feature change may be determined by the RE engine 910, formatted, and provided to the user as explanation of the concept drift that may be understandable by human operators and data scientists.

Turning back to FIG. 9, the model school (MS) 904 provides adaptive machine learning models against concept drift and is responsible for training and retraining machine learning models to be included in a library of models using input streaming data of different times, and for pre-emptive adjustment of the production model according to the detected concept drift. The MS 904 may be further responsible for keeping track of the model history in collaboration with the data manager 908. The MS 904 may be designed based on a set of parameters that may be pre-configured or may be adjusted by the user via an interactive user interface of the ACDE. Exemplary parameters may include when to adapt to new concepts after drift (for example, the MS 904 may be configured to adapt the production model on demand, on schedule, or whenever a concept drift is detected), sampling strategy for training data in generating new machine learning models in the model library, and data retention/removal criteria. The MS 904 may be configured to be resource-aware particular in the situation where the ACDE is applied to edge devices in a communication network.

As shown in FIG. 15, the model school may be configured to provide an ensemble implementation 1500 of machine learning models to the production model 902 of FIG. 9 (or 1112 of FIG. 11). Specifically, the production model 902 or 1112 may include an ensemble of models 1502 selected by the MS 904 from the model library. These models may include any combination of new models trained by the MS 904 as described above. It may further include the shadow models trained by the RE engine 910 above. The selection of the machine learning models 1502 by the MS 904 to form the production model or model assemble 902 may be based on various metrics and parameters including but not limited to a number of models in the ensemble and types of models. These parameters may be preconfigured or may be specified or configured according to user input. Each of the models 1502 in the model ensemble in production may provide a prediction 1504 for an input data item 1501 in the input data stream. The production may be weighted according to weights 1506 and combined according to a set of combination rule (as shown by 1508) to generate an overall prediction for the production model.

FIG. 16 further illustrates an exemplary algorithm and corresponding data/logic flow 1600 for an MS implementation. As shown in FIG. 16, the exemplary MS implementation 1600 may include an ensemble generator 1602 which further includes a model trainer 1604 and an ensemble updator 1606. The training of new models by the model trainer 1604 may be triggered by the concept drift detector when a concept drift is detected, by the CVFDT shadow learner described above for the RE engine when a tree-splitting feature change is detected, or on-demand by human operators, as shown by 1608. The triggering information may be passed to the model trainer 1604 via the data manager. The model ensemble updator 1606 may be based on an Accuracy Update Ensemble (AUE) technique (or any other ensemble technique such as dynamic majority votes) which may further account for hardware and computing resource limitation and constraints. In some implementations, the AUE may update the model assemble based on accuracy using sampled input data and labels 1612. The production ensemble may be updated either on schedule or on-demand from operators. New models trained by the model trainer 1604 may be based on recent streaming data and the labels generated, for example by the data labeler described above in 1106 of FIG. 11, and as shown by 1610 of FIG. 16.

In the implementation above for the model school, the AUE 1606 may be alternatively implemented as an optimization engine (e.g., a MOBOGA) and an ensemble simulator. The optimization may generate an optimized model ensemble as the production model with consideration of hardware and resource constraints. In some implementations, the optimization engine may optimize the model assemble based on accuracy using sampled input data and labels. The ensemble optimization and simulation may be iterated, similar to the optimization and simulation process described above for FIG. 13 with respect to the detector ensemble optimization. The optimized ensemble may be used to update the production model either on-schedule or on-demand.

FIG. 17 illustrates an exemplary computer architecture of a computer device 1700 on which the features of the ACDE and its various components are implemented for detecting concept drift of a machine learning model and for adapting the machine model according to the detected concept drift. The computer device 1700 includes communication interfaces 1702, system circuitry 1704, input/output (I/O) interface circuitry 1706, and display circuitry 1708. The graphical user interfaces (GUIs) 1710 displayed by the display circuitry 1708 may be representative of GUIs generated by the ACDE and its various components to, for example, receive user commands/input (e.g., on-demand triggers, adjustment of various parameters and metrics) and to display various alerts and explanation for the detected concept drift, as discussed above. The GUIs 1710 may be displayed locally using the display circuitry 1708, or for remote visualization, e.g., as HTML, JavaScript, audio, and video output for a web browser running on a local or remote machine. Among other interface features, the GUIs 1710 may further render displays of visual representations of, for example, the concept drift such as the illustrations shown in FIG. 14, the detector configurations, the Hoeffding Trees and the CVFDTs above.

The GUIs 1710 and the I/O interface circuitry 1706 may include touch sensitive displays, voice or facial recognition inputs, buttons, switches, speakers and other user interface elements. Additional examples of the I/O interface circuitry 1706 includes microphones, video and still image cameras, headset and microphone input/output jacks, Universal Serial Bus (USB) connectors, memory card slots, and other types of inputs. The I/O interface circuitry 1706 may further include magnetic or optical media interfaces (e.g., a CDROM or DVD drive), serial and parallel bus interfaces, and keyboard and mouse interfaces.

The communication interfaces 1702 may include wireless transmitters and receivers (“transceivers”) 1712 and any antennas 1714 used by the transmit and receive circuitry of the transceivers 1712. The transceivers 1712 and antennas 1714 may support WiFi network communications, for instance, under any version of IEEE 802.11, e.g., 802.11n or 802.11ac, or other wireless protocols such as Bluetooth, Wi-Fi, WLAN, cellular (4G, LTE/A). The communication interfaces 1702 may also include serial interfaces, such as universal serial bus (USB), serial ATA, IEEE 1394, lighting port, I²C, slimBus, or other serial interfaces. The communication interfaces 1702 may also include wireline transceivers 1716 to support wired communication protocols. The wireline transceivers 1716 may provide physical layer interfaces for any of a wide range of communication protocols, such as any type of Ethernet, Gigabit Ethernet, optical networking protocols, data over cable service interface specification (DOCSIS), digital subscriber line (DSL), Synchronous Optical Network (SONET), or other protocol.

The system circuitry 1704 may include any combination of hardware, software, firmware, APIs, and/or other circuitry. The system circuitry 1704 may be implemented, for example, with one or more systems on a chip (SoC), application specific integrated circuits (ASIC), microprocessors, discrete analog and digital circuits, and other circuitry. The system circuitry 1704 may implement any desired functionality of the ACDE and its various components. As just one example, the system circuitry 1704 may include one or more instruction processor 1718 and memory 1720.

The memory 1720 stores, for example, control instructions 1722 for executing the features of the ACDE and its various components, as well as an operating system 1721. In one implementation, the processor 1718 executes the control instructions 1722 and the operating system 1721 to carry out any desired functionality for the ACDE and its various components.

The computer device 1700 may further include various data sources 1730. Each of the databases that are included in the data sources 1730 may be accessed by the ACDE and its various components.

Various implementations have been specifically described. However, other implementations that include a fewer, or greater, number of features and/or components for each of the apparatuses, methods, or other embodiments described herein are also possible.

Claims

1. A computer system for providing machine learning models in a production environment, the computer system comprising a circuitry configured to:

receive streaming data over a first time period;

execute a first machine learning model in the production environment to generate prediction outputs based on the streaming data;

configure a detection procedure to detect a concept drift of the first machine learning model based on the streaming data and the prediction outputs;

when the concept drift is detected, configure a model management procedure to automatically generate a second machine learning model having improved performance over the first machine learning model;

update the first machine learning model with the second machine learning model in the production environment for generating prediction outputs in a second time period next to the first time period;

periodically quantify the performance of the detection procedure and update the detection procedure based on the performance of the detection procedure.

2. The computer system of claim 1, wherein the first machine learning model and the second machine learning model each comprises an ensemble of machine learning models selected by the model management procedure.

3. The computer system of claim 2, wherein the prediction outputs of the first machine learning model and the second machine learning model comprise weighted prediction outputs of predictions of the ensemble of machine learning models.

4. The computer system of claim 2, wherein the ensemble of machine learning models are selected from a model library maintained by the circuitry.

5. The computer system of claim 4, wherein the circuitry is further configured to perform an update procedure for generating at least one new machine learning model or retaining at least one existing machine learning model in the model library as automatically triggered by the detection of the concept drift.

6. The computer system of claim 5, wherein the generation of the at least one new machine learning model or the retraining of the at least one existing machine learning model is based on the received streaming data labeled according to response received from a separate application configured to provide the streaming data for user consumption.

7. The computer system of claim 1, wherein the detection procedure comprises two or more detection paths that are weighted for detecting the concept drift of the first machine learning model.

8. The computer system of claim 7, wherein the circuitry is configured to update the detection procedure by modifying a number of, a composition within, or weights between the two or more detection paths.

9. The computer system of claim 1, wherein the circuitry is configured to update the detection procedure based on a multi-objective Bayesian optimization.

10. The computer system of claim 1, wherein the circuitry is further configured to automatically generate a graphical representation of the detected concept drift of the first machine learning model in the production environment.

11. The computer system of claim 10, wherein the graphical representation of the detected concept drift of the first machine learning model in the production environment comprises a time evolution of a decision tree of a plurality of model features derived from the first machine learning model using a shadow learner.

12. The computer system of claim 1, wherein the detection procedure, the model management procedure, and the generation of the second machine learning model are adaptively configured according to a set of parameters transformed from a set of computing resource constraints.

13. A method for providing machine learning models in a production environment comprising:

receiving streaming data over a first time period;

executing a first machine learning model in the production environment to generate prediction outputs based on the streaming data;

configuring a detection procedure to detect a concept drift of the first machine learning model based on the streaming data and the prediction outputs;

when the concept drift is detected, configuring a model management procedure to automatically generate a second machine learning model having improved performance over the first machine learning model;

updating the first machine learning model with the second machine learning model in the production environment for generating prediction outputs in a second time period next to the first time period;

periodically quantifying performance of the detection procedure and updating the detection procedure based on the performance of the detection procedure.

14. The method of claim 13, wherein the first machine learning model and the second machine learning model each comprises an ensemble of machine learning models selected by the model management procedure.

15. The method of claim 14, wherein the prediction outputs of the first machine learning model and the second machine learning model comprise weighted prediction outputs of predictions of the ensemble of machine learning models.

16. The method of claim 15, wherein the ensemble of machine learning models are selected from a model library and wherein the method further comprises performing an update procedure for generating at least one new machine learning model or retaining at least one existing machine learning model in the model library as automatically triggered by the detection of the concept drift.

17. The method of claim 16, wherein the generation of the at least one new machine learning model or the retraining of the at least one existing machine learning model is based on the received streaming data labeled according to response received from a separate application configured to provide the streaming data for user consumption.

18. The method of claim 13, wherein the detection procedure comprises two or more detection paths that are weighted for detecting the concept drift of the first machine learning model and wherein the method further comprises updating the detection procedure by modifying a number of, a composition within, or weights between the two or more detection paths.

19. The method of claim 13, further comprising automatically generating a graphical representation of the detected concept drift of the first machine learning model in the production environment, wherein the graphical representation comprises a time evolution of a decision tree of a plurality of model features derived from the first machine learning model using a shadow learner.

20. The method of claim 13, wherein the detection procedure, the model management procedure, and the generation of the second machine learning model are adaptively configured according to a set of parameters transformed from a set of computing resource constraints.