SYSTEMS AND METHOD FOR ENHANCED ACTIVE MACHINE LEARNING THROUGH PROCESSING OF PARTITIONED UNCERTAINTY
Systems and methods are disclosed herein for improving machine learning of a data set. In one example, the method may include training a predictive model on an initial data set comprising labeled data, wherein the training is performed in an active learning system. The method may further include generating a set of parameters based on the training and introducing an unlabeled data set into the predictive model. According to some embodiments, the method may further include applying the set of parameters to the unlabeled data set, generating a set of predictions associated with the applied set of parameters and calculating a first uncertainty score and a second uncertainty score associated with the generated set of predictions. Moreover, the method may also include modifying the data set based on the first uncertainty score, and modifying the predictive model based on the second uncertainty score.
Latest Capital One Services, LLC Patents:
- SYSTEMS AND METHODS FOR SUPPORTING ALWAYS-ON APPLICATIONS FEATURING ARTIFICIAL INTELLIGENCE MODELS BY POPULATING PARALLEL DATA DOMAINS
- SYSTEMS AND METHODS FOR PREDICTING SECURITY COMMUNICATIONS BASED ON SEQUENCES OF SYSTEM ACTIVITY TOKENS
- DETERMINING DATA QUALITY USING DATA RECONSTRUCTION MODELS
- USING ON-DEMAND APPLICATIONS TO PROCESS ENCRYPTED DATA FROM A CONTACTLESS CARD
- SYSTEMS AND METHODS FOR APPLYING SCALE FACTORS TO IMAGE OBJECTS
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms may be used in a wide variety of applications, such as email filtering, computer vision, fraud detection, and the like.
In order to handle the vast quantities of data that must be processed in today's automated environments, data scientists and analysts turn to machine learning to expand processing capacity. Machine learning can construct a predictive model using artificial neural networks (ANNs) that process the training data and produce predictive outputs (predictions). In doing so, ANNs mimic biological process by using labeled initial data to “learn” what labels to associate with subsequent inputs.
In some instances, the predicted output may include one or more uncertainties associated with the model, the information collection process, the type of information collected, the information size or dimensionality, the information processing, and the like. However, predictive models typically do not account for these uncertainties, and operators/scientists cannot always determine how to calculate these uncertainties, or how they affect the predicted output. Models that do predict uncertainty often focus on only a single source of uncertainty which does not provide users a holistic picture of the modelling problem.
To solve the existing problems in the art, embodiments of the present disclosure deploy a predictive model that accounts for model based and data based uncertainties that can subsequently alert an operator and/or self-adjust in order to minimize uncertainties in the output prediction and/or intelligently deal with high uncertainties in a well-informed manner. This can increase predictive accuracy, provide business critical information about the expected error in a prediction, and provide operators an awareness of potential model bugs.
Examples of applications for the below methods, media, and systems are numerous, but a few are given here merely to indicate possible uses. Other applications will be readily apparent to one of skill in the relevant arts and are likewise contemplated by this disclosure. Moreover, description of “an embodiment” or “one embodiment” should not be construed as limiting the scope of the disclosure, as elements, sub-elements and features of a given embodiment may also be used in other embodiments of the disclosure. While methods described herein may have steps described in a specified order, it will be understood that some of those steps may be re-arranged or performed in a different order. Additionally, embodiments may be given describing applications to particular industries or commercial fields, but scope of the disclosure is not so limited.
Descriptions are given with reference to the figures included herein. When possible and for clarity, reference numbers are kept consistent from figure to figure. Some of the figures are simplified diagrams, which are not to be interpreted as drawn to scale or spatially limiting for the described embodiments. Where appropriate, the particular perspective or orientation of a figure will be given to increase understanding of the depicted features.
Communication network 125 may itself be comprised of many interconnected computer systems and communication links. Communication links 130 may be hardwired links, optical links, satellite or other wireless communications links, wave propagation links, or any other mechanisms for communication of information. Various communication protocols may be used to facilitate communication between the various systems shown in
Distributed computer network 100 in
Client systems 105 and 110 typically request information from a server system which provides the information. Server systems by definition typically have more computing and storage capacity than client systems. However, a particular computer system may act as both a client or a server depending on whether the computer system is requesting or providing information. Aspects of the system may be embodied using a client-server environment or a cloud computing environment.
In one example, server system 120 receives information requests from client systems 105 and 110, performs processing required to satisfy the requests, and forwards the results corresponding to the requests back to the requesting client system. This may include performing an initial labeling of an initial data set for training purposes. Accordingly, client systems 105/110 may request that server system 120 process an initial data set to generate a predictive model. This may entail that client system 105/110 provide the initial data set, or in the alternative, that server system 120 retrieves the initial data set from database 115. One of ordinary skill in the art would appreciate that the initial data set may also be initially stored or downloaded and stored by server system 120. Generating the predictive model may initially include labeling of unlabeled data using client systems 105/110, either through an embedded algorithm at client systems 105/110, or through an interaction between data scientists using client systems 105/110 and server system 120. For example, a data scientist using one of client systems 105/110 may connect to server system 120 through the system's user interface engine 122. General principles of labeling and training of neural networks are known in the art, for example, as described in Huzar et al., U.S. Pat. Pub. No. 2018/0240031, which is incorporated herein by reference in its entirety.
The processing required to satisfy the request may be performed by server system 120 or may alternatively be delegated to other servers connected to communication network 125. In another example, server system 120 may request information from client devices 105/110 to provide feed data for labeling and to enrich its predictive model as will be further described herein.
Client systems 105/110 enable users/data scientists to access and query information or applications stored by server system 120. A client system may be a computing device. Some example client systems include desktop computers, portable electronic devices (e.g., mobile communications devices, smartphones, tablet computers, laptops) such as the Samsung Galaxy Tab®, Google Nexus devices, Amazon Kindle®, Kindle Fire®, Apple iPhone®, the Apple iPad®, Microsoft Surface®, the Palm Pre™, or any device running the Apple iOS™, Android™ OS, Google Chrome OS, Symbian OS®, Windows Mobile® OS, Windows Phone, BlackBerry OS, Embedded Linux, webOS, Palm OS® or Palm Web OS™.
In a specific embodiment, a “web browser” application executing on a client system enables users to select, access, retrieve, or query information and/or applications stored by server system 120. Examples of web browsers include the Android browser provided by Google, the Safari® browser provided by Apple, Amazon Silk® provided by Amazon, the Opera Web browser provided by Opera Software, the BlackBerry® browser provided by Research In Motion, the Internet Explorer® and Internet Explorer Mobile browsers provided by Microsoft Corporation, the Firefox® and Firefox for Mobile browsers provided by Mozilla®, and others (e.g., Google Chrome).
In another embodiment, server system 120 may also monitor the activities of client systems 105/110 and identify security breaches or alerts that are triggered by irregular activities or activities tracked and sent to the server by network monitoring devices (not shown), or identify instances of fraud, identify theft, and other financial crime activities. For example, identifying predictions with high uncertainty related to lack of training data may indicate the presence of synthetically generated data common to first person fraud cases. Identifying such a case requires not only estimating uncertainty but also identifying that the source of uncertainty is the lack of similar training data. Generating new synthetic data is a known strategy employed by fraudulent users to circumvent existing models that make predictions without taking into account prediction uncertainty. As another example, legitimate customers may make transactions that are similar to the purchase behavior of a fraudulent user. A model that is not sensitive to uncertainties might predict that these transactions are fraudulent resulting in adverse outcomes for the legitimate customer. However, a model sensitive to uncertainty may predict that the transactions are fraudulent but also display a high uncertainty in that prediction. A human can use this additional information to make more informed judgement calls about how to treat the case of potential fraud.
An active learning system can build on machine learning models by augmenting a model with an algorithm that allows it to request additional labels from unlabeled data. Bayesian neural networks offer a principled approach to active learning. According to some embodiments, flexibility provided by a neural network may be complemented with the uncertainty measurement provided by the Bayesian network. By providing a flexible model for prediction, and combining it with a quantification of the model's uncertainty in a prediction, the model's prediction accuracy is enhanced. From this, the uncertainty may be decomposed into partitions based on two or more of epistemic uncertainty, aleatoric uncertainty, and out of sample uncertainty. Epistemic uncertainty is the scientific uncertainty in the model parameters used to predict the process of interest. It is due to limited data and knowledge. In other words, epistemic uncertainty may be thought of as being driven by the training data not being sufficient to constrain the model's prediction. Aleatoric uncertainty refers to the inherent uncertainty due to the probabilistic variability. In other words, aleatoric uncertainty may be thought of as being driven by intrinsic uncertainty in the data—i.e. similar inputs being associated with different outputs. Accordingly, when epistemic uncertainty is high, server system 120 may determine that such inputs are good candidates for submission for manual labeling and transmits the inputs to devices 105/110 for a scientist to perform manual labeling. After labeling, the inputs are sent back to server system 120 to be included in future training datasets which will help constrain the model when making predictions on similar data. When aleatoric uncertainty is high, increasing accuracy will be possible by designing/measuring new features as will be further described herein. Out of sample uncertainty is the uncertainty that comes from trying to make a prediction on a sample dissimilar to any sample in the training data. According to some embodiments, epistemic uncertainty may be used as a proxy for out of sample uncertainty. The systems described herein may or may not make the assumption that epistemic uncertainty is a proxy for out of sample uncertainty. If epistemic uncertainty is not deemed a suitable proxy for out of sample uncertainty, then a separate model may be used to directly model the out of sample uncertainty in predictions. According to some embodiments, local computing bandwidth considerations may be taken into account when determining which model to implement. For example, if a device providing the predictive model has limited computing capacity, then a single model (e.g., aleatoric/epistemic) may be used. Otherwise, two models may be used. The use of two predictive models can improve the predictive accuracy.
It may be understood that an artificial neural network is a system of hardware and software deploying algorithms to perform computation and deep learning which can provide, among other things, predictions. Artificial neural network 200 may be any type of neural network, including, for example, a feedforward neural network, a radial basis function neural network, a multilayer perceptron neural network, a convolutional neural network, and a recurrent neural network. Each type of neural network includes benefits and challenges in implementations depending on the intended purpose, usage and design of the network. One such example of a feedforward neural network is a Bayesian neural network. In a Bayesian neural network, multiple predictions are sampled from the neural network and a probabilistic model is taken for each new input. In other words, multiple samples of the outputs are produced for each new input. Each of those samples will look like a percentage or a likelihood of the categorization that is taking place. As an example, a Bayesian neural network could draw 100 different samples, where the first sample returns a 75 percent likelihood of a particular result, the second sample gives an 80 percent likelihood, the next sample produces a 52 percent likelihood, and so forth. This would continue until the 100 samples are completed, giving 100 probabilities with which to determine a certain outcome.
As shown in
In one example, one or more of the layers of the neural network 200 that generate an output from a received input in accordance with current values of a set of parameters are Bayesian neural network layers. A Bayesian neural network layer is a layer for which each parameter is associated with data representing a probability distribution over the possible values of the parameter. That is, rather than maintaining current values for the parameters of the Bayesian neural network layers, neural network 200 maintains data representing probability distributions for each parameter of each Bayesian neural network layer, e.g., in a probability distribution data repository (e.g., database 115 or a memory within server system 120). In particular, as will be described in more detail below, the probability distribution data repository stores, for each parameter of each Bayesian neural network layer, the parameters of an approximation of the probability distribution for the parameters.
According to some aspects, the input features interact with each other through adjacent relationships, like layers of less complex predictive models overlaid in linear combinations. This may include, for example, all of the input features, or some of the input features. Generally, a Bayesian neural network has a large number of parameters. Those parameters are treated as random variables rather than as fixed numbers, because of the repeated sampling of the inputs. When training the model, a specific single value is not learned for the parameter, but rather a distribution over the possible values that the parameter can take is determined. A parameter might have a Gaussian/bell shaped distribution, with the majority of the samples falling around some average value, e.g., either plus or minus five standard deviations from the average. This distribution can be fit through a variety of inferential techniques including but not limited to markov chain monte carlo (MCMC), variational inference (VI), and Lagrange approximations familiar to those skilled in the art. These techniques solve complex numerical problems through estimation to provide a distribution over likely model parameters. The numeric problems that are solved are created by defining a prior and likelihood distribution that collectively represent the information of the model builder and the training data. The epistemic uncertainty of a prediction can be derived by repeatedly sampling the distribution of the model parameters and making predictions. Each sample from the model parameters represents a potential model that could be used to describe the mean of the data process. The model could also predict the uncertainty inherent in the data process, previously referred to as aleatoric uncertainty. In this way one model would predict both epistemic and aleatoric uncertainty. Predicting the aleatoric uncertainty will be further described herein.
In one example, when averaging the output of those samples, the result of that calculation is itself a probability. Breaking down the parts of the uncertainty in the calculation requires computing the entropy of the results, which effectively instructs on how to quantify the uncertainty. For example, in the case of binary prediction, if uncertainty in the calculation is “50/50”, entropy is highest, because it is completely uncertain. On the other hand, if that uncertainty is 0/100 or 100/0, entropy is 0 (i.e., the prediction is completely certain). The entropy of the average prediction, which is a probability over multiple outcomes, estimates aleatoric uncertainty. On the other hand, taking the entropy of each of the predictions and averaging those values estimates epistemic uncertainty. These particular calculations are merely examples, therefore other methods of calculation are possible and within the scope of the application.
The outputs that are possible, which are represented in
As mentioned above, the input nodes result in a probability distribution for the inputs with this type of modeling. The results can be a histogram against all the possible probability scores that could be expected. Using that histogram, in the example of
As a further example, in a system with a large number of transactions, artificial neural network 200 can be a machine-learned system to check the validity of those transactions. Input nodes 212, 214, and 216 are various transactions that must be monitored for validity. Output nodes 232 and 234 can be the decision as to whether the inputs are valid transactions. Intermediary nodes 222, 224, 226, and 228 represent the computation in determining whether those transactions are, in fact, valid transactions.
Artificial neural network 200 is an example for understanding the context of the above and below discussions of the application. It should be understood that additional nodes may be possible in an implementation of the present disclosure. The number of input, intermediary, and/or output nodes would be constrained only by the amount of data entering the system and likewise necessary for output purposes. As such, one of skill in the relevant arts would recognize that a much larger number of input nodes, intermediary nodes and output nodes would be likely in various implementations. The three columns also are shown only as example, as multiple columns of intermediary layers, feeding into one another and representing much more complex computation, are possible. In other words, inputs may go through a number of different analysis steps, rather than simply the one intermediary phase as shown in
As noted herein, active learning systems build on machine learning models by augmenting the model with an algorithm that allows the active learning systems to request additional labels from unlabeled data. The use of Bayesian neural networks allows for the decomposition of uncertainty in two different ways to get an aleatoric and an epistemic uncertainty of the predictive model, the combination of which can lead to decisions that can help improve the performance of the model and layer better judgement on the use of model outputs. For example, a predictive model may be trained on an initial labeled data set. As new unlabeled data comes in, epistemic uncertainty scores are computed and collected. Human experts can be assigned to label the data points with the highest epistemic uncertainty scores. Simultaneously, data points with high aleatoric uncertainty in the training data can be reviewed by data scientists and new features or data collection processes can be engineered that increase accuracy on these data points. These data points and features can then be computed and added to the training set, and the model can again be refit. If data points with high out of sample uncertainty are identified, it may alert modelers that the training dataset was not collectively exhaustive allowing for data collection where the model was previously blind. Identification of the source of uncertainty allows for more efficient treatment of the uncertainty by the human-in-the-loop component. For example, aleatoric data will not decrease through the collection of more data, but requires changes to the data collection process itself. Identifying that an uncertainty is of an aleatoric nature allows humans to not only diagnose sources of uncertainty, but to also treat them properly.
Aspects of the present disclosure provide solutions that improve the active learning process of the neural network by relying on partitioned uncertainties of the predictions and determining levels of confidence of those predictions. In some embodiments described herein, the labeling may be enhanced when it is determined that uncertainty in the data is sufficiently high to warrant additional labeling being performed by a data scientist for example. Moreover, the active learning process may be further enhanced in situations where the inherent uncertainties in the model itself are sufficiently high and would require an adjustment of the model itself or the data collection process of training samples. Embodiments and examples of these methods are further described herein.
As noted herein, after training is performed, a set of predictions associated with the applied set of parameters may be generated. This may be done by using a sampling method where a number of samples are drawn and probabilities with which a certain outcome is determined are calculated. As an example, a Bayesian neural network could draw 100 different samples for a prediction, where the first sample returns a 75 percent probability of a particular result, the second sample gives an 80 percent probability, the next sample produces a 52 percent probability, and so forth. This would continue until the 100 samples are completed, giving 100 probabilities with which to determine a certain outcome/prediction. The probability of each individual sample may represent an estimate of aleatoric uncertainty. The distribution across all 100 samples may represent the epistemic uncertainty. The 100 samples would as such represent a probability distribution of probability distributions. According to some aspects, the epistemic uncertainty alone may yield a posterior distribution in the Bayesian statistical model. The posterior and/or posterior predictive uncertainties can be used in a variety of ways such as traditional estimation/prediction, decision theoretic uses, and/or reinforcement learning. Accordingly, generating the set of sampled predictions may include generating a probabilistic distribution of possible values for each parameter (epistemic uncertainty), where the true value describing the process that generated the training data often being assumed to be the most probable value within the probabilistic distribution.
After the set of predictions is generated, server system 120 may calculate the uncertainties associated with the predictions to determine if the operation is to continue, or whether corrective measures need to take place (and the nature of the corrective measures). In this regard, server system 120 may calculate the epistemic and aleatoric uncertainties of the model.
An example calculation of the aleatoric uncertainty is described further herein and illustrated in
An example calculation of the epistemic uncertainty score is described herein and illustrated further in
The following example demonstrates the potential application of the proposed solutions of the present disclosure. According to some embodiments, an enterprise may deploy a predictive model that provides business critical decisions. While every predictive model may have strengths and weaknesses, it is important for the enterprise to know what these weaknesses are when they occur, and how they can be alleviated. This is especially critical for predictive models that are involved in real-time applications, like, for example, driverless automotive decision making, fraud detection as discussed herein, and the like. According to aspects of the present disclosure, two machine learning approaches may be combined, e.g., neural networks and Bayesian statistical modeling, in a practical design, where human agents, such as quality assurance agents, or agents dealing with live customers, can use outputs of the model to quickly identify any issues in the predictions, and how to best address them. According to some aspects, combining the epistemic uncertainty and aleatoric uncertainty allows the predictive model to be smarter, self-correcting and more intuitive. For example, knowing the sufficiency of the training data and the inherent uncertainties in the probabilistic variables enables the model, and subsequently, operators, to address any potential deficiencies, as will be further discussed in the following example.
In one example, the initial data set may be related to financial information associated with a decisioning process for financial institutions. Here, the decisioning process may include fraud detection, decisions relating to fraud, and/or money laundering decisions as discussed herein. A fraud alert may be issued by, e.g., server system 120 to one or more remote devices. The fraud alert may be issued in response to an output of the predictive model being above the predetermined threshold. For example, using an application on clients 105/110, a potential customer/user may be asked to provide a photograph of government identification to verify the identity of the potential customer/user. A model may be used to predict the validity of the identification in the photograph using the photograph itself or derived features as input. According to some aspects, the model may predict two or more of A) the validity of the identification, B) the aleatoric uncertainty from the data (i.e. stemming from the grain of the photo or lighting conditions in which the photo was taken), C) the epistemic uncertainty from the model (i.e. relating to the number of training samples with similar characteristics), and D) the out of sample uncertainty (i.e. whether any training examples exist with that form of identification).
According to some aspects, the type of fraud alert issued in response to the model predictions may differ. For example, a prediction of fraud with low uncertainty may result in notification of the authorities. A prediction with high aleatoric uncertainty may prompt the potential customer to send a higher quality photo. This may be done, for example, through the application on clients 105/110, where the potential customer would be prompted to provide an image of a higher quality. As noted herein, the aleatoric uncertainty refers to the inherent uncertainty due to the probabilistic variability—i.e., driven by intrinsic uncertainty in the data, and therefore, newer/better data is requested from the potential customer. A prediction with high epistemic uncertainty may prompt a data scientist to add the sample to a future training set to better fit the model. As noted herein, the epistemic uncertainty is the scientific uncertainty in the model parameters used to predict the process of interest and is due to limited data and/or knowledge. In other words, where the training data is deemed insufficient. Accordingly, it is beneficial to prompt a data scientist to add the sample to a future training set to better fit the model. A prediction of high out of sample uncertainty may prompt further scrutiny from an internal fraud agent trained to deal with more difficult cases of potential fraud or an alert may be sent prompting the potential customer to submit a different form of identification. This is because out of sample uncertainty relates to attempting to make a prediction on a sample that is dissimilar to any sample in the training data. Accordingly, further scrutiny may be required.
Various embodiments may be implemented, for example, using one or more well-known computer systems, such as computer system 600 shown in
Computer system 600 may include one or more processors (also called central processing units, or CPUs), such as a processor 604. Processor 604 may be connected to a communication infrastructure or bus 606.
Computer system 600 may also include user input/output device(s) 603, such as monitors, keyboards, pointing devices, etc., which may communicate with communication infrastructure 606 through user input/output interface(s) 602.
One or more of processors 604 may be a graphics processing unit (GPU). In an embodiment, a GPU may be a processor that is a specialized electronic circuit designed to process mathematically intensive applications. The GPU may have a parallel structure that is efficient for parallel processing of large blocks of data, such as mathematically intensive data common to computer graphics applications, images, videos, etc.
Computer system 600 may also include a main or primary memory 608, such as random-access memory (RAM). Main memory 608 may include one or more levels of cache. Main memory 608 may have stored therein control logic (i.e., computer software) and/or data.
Computer system 600 may also include one or more secondary storage devices or memory 610. Secondary memory 610 may include, for example, a hard disk drive 612 and/or a removable storage device or drive 614. Removable storage drive 614 may be a floppy disk drive, a magnetic tape drive, a compact disk drive, an optical storage device, tape backup device, and/or any other storage device/drive.
Removable storage drive 614 may interact with a removable storage unit 618. Removable storage unit 618 may include a computer usable or readable storage device having stored thereon computer software (control logic) and/or data. Removable storage unit 618 may be a floppy disk, magnetic tape, compact disk, DVD, optical storage disk, and/any other computer data storage device. Removable storage drive 614 may read from and/or write to removable storage unit 618.
Secondary memory 610 may include other means, devices, components, instrumentalities or other approaches for allowing computer programs and/or other instructions and/or data to be accessed by computer system 600. Such means, devices, components, instrumentalities or other approaches may include, for example, a removable storage unit 622 and an interface 620. Examples of the removable storage unit 622 and the interface 620 may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, a memory stick and USB port, a memory card and associated memory card slot, and/or any other removable storage unit and associated interface.
Computer system 600 may further include a communication or network interface 624. Communication interface 624 may enable computer system 600 to communicate and interact with any combination of external devices, external networks, external entities, etc. (individually and collectively referenced by reference number 628). For example, communication interface 624 may allow computer system 600 to communicate with external or remote devices 628 over communications path 626, which may be wired and/or wireless (or a combination thereof), and which may include any combination of LANs, WANs, the Internet, etc. Control logic and/or data may be transmitted to and from computer system 600 via communication path 626.
Computer system 600 may also be any of a personal digital assistant (PDA), desktop workstation, laptop or notebook computer, netbook, tablet, smart phone, smart watch or other wearable, appliance, part of the Internet-of-Things, and/or embedded system, to name a few non-limiting examples, or any combination thereof.
Computer system 600 may be a client or server, accessing or hosting any applications and/or data through any delivery paradigm, including but not limited to remote or distributed cloud computing solutions; local or on-premises software (“on-premise” cloud-based solutions); “as a service” models (e.g., content as a service (CaaS), digital content as a service (DCaaS), software as a service (SaaS), managed software as a service (MSaaS), platform as a service (PaaS), desktop as a service (DaaS), framework as a service (FaaS), backend as a service (BaaS), mobile backend as a service (MBaaS), infrastructure as a service (IaaS), etc.); and/or a hybrid model including any combination of the foregoing examples or other services or delivery paradigms.
Any applicable data structures, file formats, and schemas in computer system 600 may be derived from standards including but not limited to JavaScript Object Notation (JSON), Extensible Markup Language (XML), Yet Another Markup Language (YAML), Extensible Hypertext Markup Language (XHTML), Wireless Markup Language (WML), MessagePack, XML User Interface Language (XUL), or any other functionally similar representations alone or in combination. Alternatively, proprietary data structures, formats or schemas may be used, either exclusively or in combination with known or open standards.
In some embodiments, a tangible, non-transitory apparatus or article of manufacture comprising a tangible, non-transitory computer useable or readable medium having control logic (software) stored thereon may also be referred to herein as a computer program product or program storage device. This includes, but is not limited to, computer system 600, main memory 608, secondary memory 610, and removable storage units 618 and 622, as well as tangible articles of manufacture embodying any combination of the foregoing. Such control logic, when executed by one or more data processing devices (such as computer system 600), may cause such data processing devices to operate as described herein.
Based on the teachings contained in this disclosure, it will be apparent to persons skilled in the relevant art(s) how to make and use embodiments of this disclosure using data processing devices, computer systems and/or computer architectures other than that shown in
Various embodiments of the present disclosure determined partitioned uncertainty scores using different methods to determine the performance of the predictive model. Based on uncertainty results, the system may modify the model itself or may rely on manual update of labeling for continued active learning within the model. This improves the performance and accuracy of the model by allowing for continued monitoring and adjustment of the model using additional labeling and/or model adjustment. Moreover, this enables a processing device, such as server 120, to review and label a small portion of transactions/unlabeled data for training purposes because the solutions advocated herein allow for more accurate adaptive and active learning model. A use case has been provided above where transaction validity is monitored by the machine-learned process. However, other examples are contemplated within the scope of the application as seen through other embodiments of the disclosure. As one example, the above systems, media and methods can be used by a financial institution in an effort to detect money laundering. While a large number of transactions must be monitored in order to detect those that relate back to nefarious purposes, manual classification of each transaction is not possible and can be aided by machine-learned predictive models. Another example of data-intensive decisioning comes from credit determination applications. The ability to monitor large volumes of transactions across multiple credit-rating agencies internally at a financial institution would allow for computer-aided determinations for credit lines and products. Moreover, additional embodiments are possible in non-financial settings, such as medical diagnostics, life sciences research, image reproduction and processing, and voice/speech recognition applications.
Descriptions to an embodiment contemplate various combinations, components and sub-components. However, it will be understood that other combinations of the components and sub-components may be possible while still accomplishing the various aims of the present application. As such, the described embodiments are merely examples, of which there may be additional examples falling within the same scope of the disclosure.
Claims
1. A computer-implemented method of improving machine learning of a data set, the method comprising:
- training a predictive model on an initial data set comprising labeled data, wherein the training is performed in an active learning system;
- generating a set of parameters based on the training;
- introducing an unlabeled data set into the predictive model;
- applying the set of parameters to the unlabeled data set;
- generating a set of predictions associated with the applied set of parameters;
- calculating a first uncertainty score and a second uncertainty score associated with the generated set of predictions;
- modifying the data set based on the first uncertainty score; and
- modifying the predictive model based on the second uncertainty score.
2. The computer-implemented method of claim 1, wherein generating the set of predictions further comprises:
- generating a probabilistic distribution of possible values for each parameter of the applied set of parameters;
- selecting a value for each parameter of the applied set of parameters, the value being a highest value within the probabilistic distribution; and
- assigning the selected value to be a predicted value for each parameter of the applied set of parameters.
3. The computer-implemented method of claim 2, wherein the first uncertainty score is an aleatoric uncertainty score.
4. The computer-implemented method of claim 3, wherein calculating the aleatoric uncertainty score comprises:
- calculating an average prediction value of an overall output of the predictive model, the overall output including a prediction for each parameter of the set of parameters;
- generating an entropy value based on the average value; and
- outputting the entropy value as the aleatoric uncertainty score.
5. The computer-implemented method of claim 4, further comprising:
- modifying the generated parameters in response to the aleatoric uncertainty score being above a predetermined threshold.
6. The computer-implemented method of claim 5, further comprising retraining the predictive model subsequent to modifying the generated parameters.
7. The computer-implemented method of claim 2, wherein the second uncertainty score is an epistemic uncertainty score.
8. The computer-implemented method of claim 7, wherein calculating the epistemic uncertainty score comprises:
- calculating an entropy value associated with each predicted value of the applied set of parameters;
- calculating an average of the calculated entropy values; and
- outputting the calculated average as the epistemic uncertainty score.
9. The computer-implemented method of claim 8, further comprising:
- modifying the unlabeled data set by assigning a piece of data for human-assisted labeling responsive to the epistemic uncertainty score exceeding a specified threshold.
10. The computer-implemented method of claim 9, further comprising:
- further comprising retraining, based on the human-assisted labeling, the predictive model, wherein the retraining is performed with a remainder of the unlabeled data set.
11. The computer-implemented method of claim 3, wherein the second uncertainty score is an epistemic uncertainty score.
12. The computer-implemented method of claim 1, wherein training the predictive model is performed using Bayesian neural networks.
13. The computer-implemented method of claim 1, wherein:
- the initial data set is financial information associated with a decisioning process for a financial institution,
- the decisioning process is a fraud detection decision or a money laundering detection decision, and
- a fraud alert is issued in response to an output of the predictive model being above a predetermined threshold.
14. A non-transitory computer-readable medium having a set of computer-executable instructions embodied thereon that, when executed by a processor, cause the processor to perform a method of improving machine learning of a data set, the method comprising:
- training a predictive model on an initial data set comprising labeled data, wherein the training is performed in an active learning system;
- generating a set of parameters based on the training;
- introducing an unlabeled data set into the predictive model;
- applying the set of parameters to the unlabeled data set;
- generating a set of predictions associated with the applied set of parameters;
- calculating a first uncertainty score and a second uncertainty score associated with the generated set of predictions;
- modifying the data set based on the first uncertainty score; and
- modifying the predictive model based on the second uncertainty score.
15. The computer-readable medium of claim 14, wherein generating the set of predictions further comprises:
- generating a probabilistic distribution of possible values for each parameter of the applied set of parameters;
- selecting a value for each parameter of the applied set of parameters, the value being a highest value within the probabilistic distribution; and
- assigning the selected value to be a predicted value for each parameter of the applied set of parameters.
16. The computer-readable medium of claim 15, wherein the first uncertainty score is an aleatoric uncertainty score.
17. The computer-readable medium of claim 16, wherein calculating the aleatoric uncertainty score comprises:
- calculating an average prediction value of an overall output of the predictive model, the overall output including a prediction for each parameter of the set of parameters;
- generating an entropy value based on the average value; and
- outputting the entropy value as the aleatoric uncertainty score.
18. The computer-readable medium of claim 15, wherein the second uncertainty score is an epistemic uncertainty score.
19. The computer-readable medium of claim 18, wherein calculating the epistemic uncertainty score comprises:
- calculating an entropy value associated with each predicted value of the applied set of parameters;
- calculating an average of the calculated entropy values; and
- outputting the calculated average as the epistemic uncertainty score.
20. A system comprising:
- a memory configured to store operations; and
- one or more processing devices configured to process the operations, the operations comprising:
- training a predictive model on an initial data set comprising labeled data, wherein the training is performed in an active learning system;
- generating a set of parameters based on the training;
- introducing an unlabeled data set into the predictive model;
- applying the set of parameters to the unlabeled data set;
- generating a set of predictions associated with the applied set of parameters;
- calculating a first uncertainty score and a second uncertainty score associated with the generated set of predictions;
- modifying the data set based on the first uncertainty score; and
- modifying the predictive model based on the second uncertainty score.
Type: Application
Filed: Sep 3, 2020
Publication Date: Mar 3, 2022
Patent Grant number: 11790369
Applicant: Capital One Services, LLC (McLean, VA)
Inventors: Jason Wittenbach (Falls Church, VA), James O.H. Montgomery (McLean, VA), Christopher Bayan Bruss (Washington, DC)
Application Number: 17/011,839