Predictive Modeling from Distributed Datasets

- Microsoft

Techniques for using data sets for a predictive model are described. According to various implementations, techniques described herein enable different data sets to be used to generate a predictive model, while minimizing the risk that individual data points of the data sets will be exposed by the predictive model. This aids in protecting individual privacy (e.g., protecting personally identifying information for individuals), while enabling robust predictive models to be generated using data sets from a variety of different sources

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description
RELATED APPLICATION

This application claims priority to U.S. provisional application No. 62/472,962, filed on 17 Mar. 2017 and titled “Predictive Modeling,” the disclosure of which is incorporated by reference in its entirety herein.

BACKGROUND

Today's era of “big data” includes different data systems with access to tremendous amounts of data of a variety of different types, such as consumer data, educational data, medical data, social networking data, and so forth. This data can be processed in various ways and utilized for different useful purposes. Educational data, for instance, can be analyzed to identify different trends and outcomes in educational processes to optimize those processes. Medical data can be analyzed to identify predictive indicators of different medical conditions. Protecting privacy of individuals associated with data, however, is of paramount importance.

SUMMARY

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Techniques for using data sets for a predictive model are described. According to various implementations, techniques described herein enable different data sets to be used to generate a predictive model, while minimizing the risk that individual data points of the data sets will be exposed by the predictive model or by the process of generating it. This aids in protecting individual privacy (e.g., protecting personally identifying information for individuals), while enabling robust predictive models to be generated using data sets from a variety of different sources.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different instances in the description and the figures may indicate similar or identical items. Identical numerals followed by different letters in a reference number may refer to difference instances of a particular item.

FIG. 1 is an illustration of an environment in an example implementation that is operable to employ techniques discussed herein.

FIG. 2 depicts an example implementation scenario for a high level overview of predictive model training in accordance with one or more implementations.

FIG. 3 depicts an example implementation scenario for predictive model training using distributed hosts in accordance with one or more implementations.

FIG. 4 is a flow diagram that describes steps in a method for enabling a predictive model to be generated in accordance with one or more implementations.

FIG. 5 is a flow diagram that describes steps in a method for generating a predictive model in accordance with one or more implementations.

FIG. 6 is a flow diagram that describes steps in a method for enabling a predictive model to be generated using multiple hosts in accordance with one or more implementations.

FIG. 7 is a flow diagram that describes steps in a method for enabling a predictive model to be generated using multiple hosts in accordance with one or more implementations.

FIG. 8 is a flow diagram that describes steps in a method for utilizing a predictive model in accordance with one or more implementations.

FIG. 9 illustrates an example system and computing device as described with reference to FIG. 1, which are configured to implement implementations of techniques described herein.

DETAILED DESCRIPTION

Techniques for using data sets for a predictive model are described. Generally, a predictive model represents a collection of evaluable conditions to which a data set can be applied to determine a possible, predicted outcome. In at least one implementation, a predictive model is a neural network.

According to various implementations, techniques described herein enable different data sets to be used to generate a predictive model, while minimizing the risk that individual data points of the data sets will be exposed by the predictive model. This aids in protecting individual privacy (e.g., protecting personally identifying information for individuals), while enabling robust predictive models to be generated using data sets from a variety of different sources.

In example implementations, different data sources with different data sets use their respective data sets as training sets to train a data model. As part of the training, the data sources obtain gradient values and submit the gradient values to an external system that processes the gradient values to determine optimal ways for training the data model to generate a predictive model, e.g., a trained neural network. The external system, for example, determines average gradient values based on a collection of gradient values from different data sources. Further, the external system adds noise to the average gradient values to avoid directly or inferentially exposing information about individual data points of the local data sets. The noisy gradient values are used to further train the data model and generate a trained predictive model.

According to various implementations, data sets used to generate a predictive model can be very large. Thus, techniques described herein enable local data sources that maintain the data sets to perform various local computations on their large data sets to generate gradient values. The gradient values can then be communicated to an external system that uses the gradient values to calculate optimum gradient values and add noise to the optimum gradient values for generating a predictive model that protects individual data points from exposure outside their respective data sets.

Thus, techniques described herein protect individual and group privacy by reducing the likelihood that individual records of a data set will be exposed when generating a predictive model using the data set. Further, computational and network resources are conserved by enabling local data sources to perform computations of gradients based on their own respective data sets, and enabling an external system to use the gradients to generate a predictive model based on the different data sets. The external system, for example, need not process entire large data sets, but can perform various calculations described herein using smaller data sets that summarize the larger data sets.

In the following discussion, an example environment is first described that is operable to employ techniques described herein. Next, some example implementation scenarios are described in accordance with one or more implementations. Following this, some example procedures are described in accordance with one or more implementations. Finally, an example system and device are described that are operable to employ techniques discussed herein in accordance with one or more implementations. Consider now an example environment in which example implementations may by employed.

FIG. 1 is an illustration of an environment 100 in an example implementation that is operable to employ techniques for using data sets for a predictive model described herein. Generally, the environment 100 includes various devices, services, and networks that enable data communication via a variety of different modalities. For instance, the environment 100 includes source systems 102 and a host system 104 connected to a network 106. Generally, the source systems 102 represent different data sources that can provide data for generating predictive models. The source systems 102 include various instances of information systems that collect and aggregate different types of data, such as medical information (e.g., patient records, medical statistics, and so forth) from medical institutions, education information from educational institutions, consumer information from enterprise entities, government information from governmental entities, social networking information regarding users of different social networking platforms, and so on. The source systems 102 may be implemented in various ways, such as servers, server systems, distributed computing systems (e.g., cloud servers), corpnets, and so on. Examples of different implementations of the source systems 102 are described below with reference to the example system 900.

The source systems 102 include data sets 108 and local computation modules (“local modules”) 110. The data sets 108 represent sets of different types of data, examples of which are described above. Generally, each of the source systems 102 aggregates and maintains its own respective data set 108. The local modules 110 represent functionality for performing different sets of computations on the data sets 108 as well as other types of data. As further detailed herein, some forms of computation can be performed locally by the local modules 110, while others can be performed at the host system 104.

The host system 104 is representative of functionality to perform various computations outside of the context of the source systems 102. For instance, the host system 104 can receive data from the source systems 102, and can perform different calculations using the data. Accordingly, the host system 104 includes a multiparty computation module (“multiparty module”) 112, which in turn includes a privacy module 114. In accordance with implementations for using data sets for a predictive model described herein, the multiparty module 112 represents functionality for performing various calculations on data received from the source systems 102 to generate predictive models 116. Generally, the predictive models 116 represent statistical models that are generated based on attributes of the data sets 108 and that can be used to predict various outcomes dependent on input data values. In at least one implementation, the predictive models 116 represent different instances of a neural network.

As further detailed below, cooperation between the source systems 102 and the host system 104 enables various attributes of the different data sets 108 to be used to generate the predictive models 116, while protecting the raw data from an individual data set 108 from being exposed (e.g., directly or inferred) across the different source systems 102. This enables multiple data sets 108 to be used to generate an individual predictive model 116 thus increasing a robustness and accuracy of the individual predictive model 116, while protecting a data set 108 from one source system 102 from being exposed to a different source system 102.

The network 106 is representative of a network that provides the source systems 102 and the host system 104 with connectivity to various networks and/or services, such as the Internet. The network 106 may be implemented via a variety of different connectivity technologies, such as broadband cable, digital subscriber line (DSL), wireless cellular, wireless data connectivity (e.g., WiFi™), T-carrier (e.g., T1), Ethernet, and so forth. In at least some implementations, the network 106 represents different interconnected wired and wireless networks.

While the source systems 102 and the host system 104 are depicted as being remote from one another, it is to be appreciated that in one or more implementations, one or more of the source systems 102 and the host system 104 may be implemented as part of a single, multifunctional system to perform various aspects of using data sets for a predictive model described herein. For instance, in some implementations, the host system 104 can be implemented as a secure hardware environment that is local to a particular source system 102, but that is protected from tampering by functionalities outside of the secure hardware environment.

Having described an example environment in which the techniques described herein may operate, consider now a discussion of some example implementation scenarios for using data sets for a predictive model in accordance with one or more implementations. The implementation scenarios may be implemented in the environment 100 discussed above, the system 900 described below, and/or any other suitable environment.

FIG. 2 depicts an example implementation scenario 200 which represents a high level overview of predictive model training in accordance with one or more implementations. The scenario 200 includes various entities and components introduced above with reference to the environment 100.

In the scenario 200, the host system 104 distributes an initial model 202 separately to a source system 102a and a source system 102b. Generally, the initial model 202 represents a starting data model that is subsequently trained according to techniques described herein to generate a predictive model. The source systems 102a, 102b represent separate sources of a data set 108a and a data set 108b, respectively. In at least one implementation, the data sets 108a, 108b represent different respective sets of data of a same type. For instance, the data sets 108a, 108b can include medical data, education data, enterprise data, and so forth.

Continuing with the scenario 200, local modules 110a, 110b of the source systems 102a, 102b each perform training operations 204a, 204b on their respective instances of the initial model 202 and using their respective data sets 108a, 108b to generate respective gradient values 206a, 206b. Generally, the training operations 204a, 204b can be performed in a variety of different ways for training a neural network. In this particular example, the training operations 204a, 204b represent a backpropagation technique that is applied to the initial models 202 using mini-batches 208a, 208b of the respective data sets 108a, 108b. Consider, for example, that the data sets 108a, 108b represent collections of data records, such as patient records from medical data. Accordingly, the mini-batches 208a, 208b represent subsets of the collections of data records. As further detailed below, generating a trained data model can be implemented as an iterative process with each iteration using a different mini-batch 208a, 208b of the data sets 108a, 108b.

The gradient values 206a, 206b generally represent respective gradients of a loss function utilized as part of the training operations 204a, 204b. Proceeding with the scenario 200, the source systems 102a, 102b communicate their respective gradient values 206a, 206b to the multiparty module 112, which processes the gradients 206a, 206b to generate an average gradient 210. An averaging function, for instance, is applied to the gradients 206a, 206b to generate the average gradient 210. The privacy module 114 then processes the average gradient 210 to generate a noisy gradient 212. For example, the privacy module 114 adds noise to the average gradient 210 to generate the noisy gradient 212. Generally, adding noise to the average gradient 210 reduces a likelihood that actual data values from the data sets 108a, 108b can be discovered or inferred from the noisy gradient 212.

The multiparty module 112 then communicates the noisy gradient 212 separately to the source systems 102a, 102b. The local modules 110a, 110b on the source systems 102a, 102b utilize the noisy gradient 212 to perform a training iteration on the initial model 202 to generate an updated model 214. According to various implementations, this process is repeated (e.g., for each of the mini-batches 208a, 208b) until all of the data sets 108a, 108b have been evaluated to generate a predictive model 116. Generally, the predictive model 116 represents an optimized version of the initial model 202 that can be evaluated with a set of input data to generate a predicted outcome value or set of values. The predictive model 116 may be generated at the host system 104, and/or individually at the source systems 102a, 102b.

FIG. 3 depicts an example implementation scenario 300 which represents predictive model training using distributed hosts in accordance with one or more implementations. The scenario 300, for instance, represents a variation on the scenario 200 described above.

The scenario 300 includes a host system 104a and a host system 104b, which represent different instances of the host system 104 introduced above. Generally, the host systems 104a, 104b represent individual autonomous systems that are able to communicate with one another to perform various aspects of techniques described herein, but that are also able to protect certain data from being accessible across the host systems 104a, 104b.

Similarly to the scenario 200, the source systems 102a, 102b start with an initial model 302 and calculate respective gradients 304a, 304b based on their respective data sets 108a, 108b. As mentioned above, the gradients 304a, 304b can be calculated using a backpropagation technique that is applied to the initial models 302a, 302b using the mini-batches 208a, 208b of the respective data sets 108a, 108b.

In the scenario 300, however, the source systems 102a, 102b use a secret sharing technique to further enhance the security and privacy aspects of techniques for [title] described herein. Accordingly, the source system 102a calculates a perturbation value 306a and generates a perturbed gradient 308a which represents the gradient 304a+perturbation value 306a. In at least one implementation, the perturbation value 306a represents a random vector with the same dimensions as the gradient 304a. The source system 102a then communicates the perturbed gradient 308a to the host system 104a, and the perturbation value 306a to the host system 104b.

Similarly, the source system 102b calculates a perturbation value 306b and generates a perturbed gradient 308b which represents the gradient 304b+perturbation value 306b. The source system 102b then communicates the perturbed gradient 308b to the host system 104a, and the perturbation value 306b to the host system 104b.

Continuing with the scenario 300, the host systems 104a, 104b sum the values that they've received from the respective source systems 102a, 102b. The host system 104a, for instance, sums the perturbed gradients 308a, 308b to generate a gradient sum 310. The host system 104a then adds noise to the gradient sum 310 to generate a noisy gradient sum 312.

Further, the host system 104b sums the perturbation values 306a, 306b to generate a perturbation sum 314. The host system 104b then adds noise to the perturbation sum 314 to generate a noisy perturbation sum 316.

The host systems 104a, 104b then engage in a cooperative protocol 318 using the noisy gradient sum 312 and the noisy perturbation sum 316 to generate a noisy gradient 320. The cooperative protocol 318, for instance, represents a secure computation procedure performed between the host systems 104a, 104b. In one example implementation, the cooperative protocol 318 is implemented as a garbled circuit protocol using the noisy gradient sum 312 and the noisy perturbation sum 316 as inputs to generate the noisy gradient 320. Generally, the noisy gradient 320 represents an average of the perturbed gradients 308a, 308b with noise added to the data.

Accordingly, the noisy gradient 320 is communicated to the source systems 102a, 102b, which use the noisy gradient 320 to update the initial model 302 to generate an updated model 322. Generally, this process is performed iteratively until a termination criterion is reached, such as when all of the mini-batches 208a, 208b have been evaluated, to obtain the predictive model 116. Thus, the scenario 300 illustrates that distributed calculations can be utilized to further enhance security of techniques for [title] described herein.

Having discussed some example implementation scenarios, consider now a discussion of some example procedures in accordance with one or more implementations.

The following discussion describes some example procedures for using data sets for a predictive model in accordance with one or more implementations. The example procedures may be employed in the environment 100 of FIG. 1, the system 900 of FIG. 9, and/or any other suitable environment. The procedures, for instance, represent example procedures for performing the implementation scenarios described above. In at least some implementations, the steps described for the various procedures are implemented automatically and independent of user interaction.

FIG. 4 is a flow diagram that describes steps in a method in accordance with one or more implementations. The method describes an example procedure for enabling a predictive model to be generated in accordance with one or more implementations.

Step 400 calculates a gradient value based on a data set applied to an initial data model. In at least one implementation, a source system 102 calculates the gradient value using backpropagation with a data set 108 and an initial model as input. As described above, the data set 108 may be divided into mini-batches, and thus a particular gradient value can be calculated for a discrete mini-batch.

Step 402 communicates the gradient value to an external service. A source system 102, for instance, communicates the gradient value to the host system 104.

Step 404 receives an average gradient value from the external service. For example, a source system 102 receives the average gradient value from the host system 104. Generally, the average gradient value represents an average of multiple gradient values received from multiple different source systems 102 and based on multiple different data sets 108. Further, the average gradient value is a noisy gradient, i.e., a raw average gradient value to which a noise term has been added.

Step 406 applies the average gradient value to the initial data model. A local module 110 at a source system 102, for instance, applies the average gradient value to an initial model to generate an updated model. For example, the average gradient value is used to update one or more weight and bias values for the initial model 202.

Step 408 ascertains whether a termination criterion occurs. Generally, a termination criterion represents an event that indicates whether an iterative process of training the data model is to terminate. In at least one implementation, the termination criterion represents an indication that a set number of mini-batches 208 have been evaluated according to the process described above. In another example implementation, the termination criterion represents an indication that a specified number of iterations through the process have been performed. In another example implementation, the termination criterion represents an indication that the trained model did not significantly change for the last few iterations. In another example implementation, the termination criterion represents an indication that the accuracy of the model, as tested on some validation set, did not improve or even deteriorated, over the last few iterations.

If the termination criterion does not occur (“No”), the process returns to step 400 where additional gradient values are calculated and used to update the data model.

If the termination criterion occurs (“Yes”), step 410 obtains a predictive model that represents a trained version of the initial data model. The predictive model, for instance, represents a neural network whose weights and biases have been trained according to techniques for [title] described herein. In at least one implementation, the predictive model can be generated locally at a source system 102 using noisy gradient values obtain from the host system 104. Alternatively or additionally, the predictive model can be received from the host system 104. Generally, the predictive model can be used for various purposes, such as predicting an outcome based on a set of input values.

FIG. 5 is a flow diagram that describes steps in a method in accordance with one or more implementations. The method describes an example procedure for generating a predictive model in accordance with one or more implementations.

Step 500 receives multiple gradient values from multiple different source systems. The host system 104, for instance, receives gradient values from multiple different source systems 102.

Step 502 generates an average gradient value from the multiple gradient values. Each of the multiple gradient values, for instance, is a different value, e.g., a different gradient of a loss function calculated at a respective source system 102. Thus, the host system 104 averages the different gradient values to obtain an average gradient value.

Step 504 adds a noise term to the average gradient value to generate a noisy gradient average. In at least one implementation, the noise term is added as random noise added to the average gradient value, such as a Laplace-distributed random number added to the average gradient value. In at least one implementation, the noisy gradient average can be calculated via interaction between multiple hosts, such as discussed with reference to the scenario 300. For instance, the noisy gradient average can be calculated via a garbled circuits protocol performed between the host systems 104a, 104b.

Step 506 communicates the noisy gradient average to the multiple different source systems. The host system 104, for instance, communicates the noisy gradient average to the multiple different source systems 102.

Step 508 ascertains whether a termination criterion occurs. Different examples of a termination criterion are discussed above. If the termination criterion does not occur (“No”), the process returns to step 500. For instance, further gradient values are received and are averaged to generate further noisy gradient averages, which are communicated back to the source systems 102. This process can be performed iteratively to enable the source systems 102 to iteratively train their respective data models.

If the termination criterion occurs (“Yes”), step 510 obtains a predictive model trained using the noisy gradient average. In at least one implementation, the predictive model can be generated locally at the host system 104, and/or locally at the individual source systems 102.

FIG. 6 is a flow diagram that describes steps in a method in accordance with one or more implementations. The method describes an example procedure for enabling a predictive model to be generated using multiple hosts in accordance with one or more implementations.

Step 600 calculates a gradient value based on a data set applied to a data model. In at least one implementation, a source system 102 calculates the gradient value using backpropagation with a data set 108 and the initial model 202 as input.

In at least one implementation, the gradient value is calculated as:


giz∈ZiClip(C,F′(wt,z))  Equation 1:

where Zi is the dataset used in the i′th minibatch, C is a bound on a size of the gradient, F is the function of the data model to be optimized, wt is the current weight vector, and z is an example from the current mini-batch. Clip can be calculated as:

Clip ( C , x ) = min ( 1 , C || x || ) x , Equation 2

where x is the vector being calculated for the gradient value.

Step 602 generates a perturbed gradient value based on the gradient value and a perturbation value. A source system 102, for instance, generates a perturbation value, and adds the perturbation value to the original gradient value to generate the perturbed gradient value.

In one example, the perturbation value ri is generated as:


ri←Laplace(b),  Equation 3:

which represents a random vector with the same dimension as gi sampled from the Laplace distribution.

Accordingly, the perturbed gradient value can be generated as gi+ri.

Step 604 communicates the perturbed gradient value to a first host system. The source system 102, for instance, communicates the perturbed gradient value to a first host system 104.

Step 606 communicates the perturbation value to a second host system. For example, the source system 102 communicates the perturbation value to a second host system 104. In at least one implementation, the first host system 104 and the second host system 104 represent host systems that are physically and/or communicatively remote from one another and that are protected from mutual access. Alternatively, the first host system 104 and the second host system 104 represent protected portions of a single larger system, such as different trusted platform modules (TPM) that reside on a single server and/or other computing device.

Step 608 receives an average gradient value from one or more of the first host system or the second host system. Generally, the average gradient value represents a perturbed average gradient value and is based calculations performed at the different host systems using the perturbed gradient value and the perturbation value, as well as other perturbed gradient values and perturbation values from other source systems.

Step 610 applies the average gradient value to the data model. For instance, a weight value and/or a bias value from the average gradient value are applied to update (e.g., train) the data model.

Step 612 ascertains whether a termination criterion occurs. Different examples of termination criteria are discussed above. If the termination criterion does not occur (“No”), the process returns to step 600. For instance, the source system 102 determines a further gradient value based on the updated data model, and the process proceeds as indicated above using the further gradient value.

If the termination criterion occurs (“Yes”), step 614 obtains a predictive model that represents a trained version of the data model. The predictive model, for instance, is generated locally at the source system 102 and based on different gradient values received from the host systems 104. Alternatively or additionally, the predictive model is communicated to the source system 102 from one or more of the host systems 104.

FIG. 7 is a flow diagram that describes steps in a method in accordance with one or more implementations. The method describes an example procedure for enabling a predictive model to be generated using multiple hosts in accordance with one or more implementations. In this particular example, portions of the method are divided into actions at a first host system and actions at a second host system.

Step 700 receives perturbed gradients representing gradient values summed with perturbation values from multiple different source systems. The host system 104a, for instance, receives the perturbed gradients from different source systems 102.

Step 702 sums the perturbed gradients to generate a gradient sum. For example, the host system 104a sums a set of perturbed gradients to generate a gradient sum. The gradient sum {tilde over (g)}1, for instance is generated as:


{tilde over (g)}1i(gi+ri)s mod mC  Equation 4:

In at least one implementation, smod is a symmetric mode operation, such as calculated as:


x mod C=((x+C)mod 2C)−C  Equation 5:

Step 704 calculates a first seed for a random number generator. The host system 104a, for example, calculates a seed value s1.

Step 706 receives perturbation values from the multiple different source systems. For instance, the host system 104b receives perturbation values that were used to generate the perturbed gradients from multiple different source systems 102.

Step 708 sums the perturbation values to generate a perturbation sum. The host system 104b, for example, sums the perturbation values as:


{tilde over (g)}2iris mod mC  Equation 6:

Step 710 calculates a second seed for the random number generator. The host system 104b, for example, calculates a seed value s2.

Step 712 implements a secure computation protocol using the gradient sum, the perturbation sum, the first seed, and the second seed to generate a noisy average of the gradient values. The host systems 104a, 104b, for example, interact to perform a secure computation protocol using these different sets of values. In at least one implementation, the host systems 104a, 104b participate in a garbled circuits protocol to compute the noisy average as:


(({tilde over (g)}1−{tilde over (g)}2)s mod mC)+Rands1⊕s2(b),  Equation 7:

where b is an arbitrarily defined random noise parameter that is a function of the required privacy.

Step 714 communicates the noisy average to a source system to enable a predictive model to be trained using the noisy average. One or more of the host systems 104a, 104b, for instance, communicate the noisy gradient 320 to the source systems 102a, 102b. Generally, the noisy gradient 320 can be used as part of a training step to generate a trained predictive model 116, e.g., a trained neural network.

Step 716 ascertains whether a termination criterion occurs. Different examples of termination criteria are discussed above. If the termination criterion does not occur (“No”), the process returns to step 700. For instance, the host systems 104a, 104b receive further gradient values from the source systems 102a, 102b, and the process iterates until a termination criterion occurs.

If the termination criterion occurs (“Yes”), step 718 obtains a predictive model that represents a trained version of an initial data model. The predictive model, for instance, is generated locally at the source systems 102a, 102b and based on different noisy gradient values received from the host systems 104. Alternatively or additionally, the predictive model is communicated to the source systems 102a, 102b from one or more of the host systems 104a, 104b.

Generally, a predictive model generated according to techniques for [title] described herein can be used for various purposes, such as predicting outcomes based on various input data sets and scenarios.

FIG. 8 is a flow diagram that describes steps in a method in accordance with one or more implementations. The method describes an example procedure for utilizing a predictive model in accordance with one or more implementations. The method, for instance, represents a continuation of one or more of the procedures described above.

Step 800 applies a set of input data to a predictive model. A source system 102, for example, receives a set of data and uses the set of data to evaluate a predictive model generated according to techniques for using data sets for a predictive model described herein. In at least some implementations, the set of data includes data values that are evaluated using the predictive model.

Step 802 ascertains an output of the predictive model. For instance, the predictive model provides an output prediction value based on values of the input data.

Step 804 performs, by a computing device, an action based on the output of the predictive model. Generally, the action can take various forms, such as performing different computation tasks based on the output of the predictive model. For example, consider that the predictive model is configured to provide a prediction of health condition. If the output of the predictive model indicates a possible adverse health condition, the action can include performing an automatic scheduling of a health procedure and/or an automatic communication to an individual regarding the possible adverse health condition.

As another example, consider that the predictive model is configured to provide a prediction of a possible computer network malfunction. For instance, the predictive model can include various conditions and events that are indicative of a potential network failure. Accordingly, the action can include performing an automated maintenance and/or diagnostic procedure on the network to attempt to prevent and/or repair a network malfunction.

These examples are presented for purpose of illustration only, and it is to be appreciated that predictive models generated and/or trained according to techniques for using data sets for a predictive model described herein can be used for a variety of different purposes not expressly discussed in this disclosure.

Thus, techniques for using data sets for a predictive model described herein provide ways for generating predictive models based on data sets from a variety of different sources, while protecting the data used to generate the predictive models from being exposed to unauthorized parties. Further, computational resources are conserved by enabling local data sources to perform averaging of data points from large data sets, while allowing a centralized service (e.g., a host system 104 or set of host systems 104) to generate predictive models using the locally averaged data points.

Having discussed some example procedures, consider now a discussion of an example system and device in accordance with one or more implementations.

FIG. 9 illustrates an example system generally at 900 that includes an example computing device 902 that is representative of one or more computing systems and/or devices that may implement various techniques described herein. For example, the source systems 102 and/or the host systems 104 discussed above with reference to FIG. 1 can be embodied as the computing device 902. The computing device 902 may be, for example, a server of a service provider, a device associated with the client (e.g., a client device), an on-chip system, and/or any other suitable computing device or computing system.

The example computing device 902 as illustrated includes a processing system 904, one or more computer-readable media 906, and one or more Input/Output (I/O) Interfaces 908 that are communicatively coupled, one to another. Although not shown, the computing device 902 may further include a system bus or other data and command transfer system that couples the various components, one to another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. A variety of other examples are also contemplated, such as control and data lines.

The processing system 904 is representative of functionality to perform one or more operations using hardware. Accordingly, the processing system 904 is illustrated as including hardware element 910 that may be configured as processors, functional blocks, and so forth. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. The hardware elements 910 are not limited by the materials from which they are formed or the processing mechanisms employed therein. For example, processors may be comprised of semiconductor(s) and/or transistors (e.g., electronic integrated circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.

The computer-readable media 906 is illustrated as including memory/storage 912. The memory/storage 912 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 912 may include volatile media (such as random access memory (RAM)) and/or nonvolatile media (such as read only memory (ROM), Flash memory, optical disks, magnetic disks, and so forth). The memory/storage 912 may include fixed media (e.g., RAM, ROM, a fixed hard drive, and so on) as well as removable media (e.g., Flash memory, a removable hard drive, an optical disc, and so forth). The computer-readable media 906 may be configured in a variety of other ways as further described below.

Input/output interface(s) 908 are representative of functionality to allow a user to enter commands and information to computing device 902, and also allow information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice recognition and/or spoken input), a scanner, touch functionality (e.g., capacitive or other sensors that are configured to detect physical touch), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to detect movement that does not involve touch as gestures), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, tactile-response device, and so forth. Thus, the computing device 902 may be configured in a variety of ways as further described below to support user interaction.

Various techniques may be described herein in the general context of software, hardware elements, or program modules. Generally, such modules include routines, programs, objects, elements, components, data structures, and so forth that perform particular tasks or implement particular abstract data types. The terms “module,” “functionality,” “entity,” and “component” as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of commercial computing platforms having a variety of processors.

An implementation of the described modules and techniques may be stored on or transmitted across some form of computer-readable media. The computer-readable media may include a variety of media that may be accessed by the computing device 902. By way of example, and not limitation, computer-readable media may include “computer-readable storage media” and “computer-readable signal media.”

“Computer-readable storage media” may refer to media and/or devices that enable persistent storage of information in contrast to mere signal transmission, carrier waves, or signals per se. Computer-readable storage media do not include signals per se. The computer-readable storage media includes hardware such as volatile and non-volatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer readable instructions, data structures, program modules, logic elements/circuits, or other data. Examples of computer-readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage device, tangible media, or article of manufacture suitable to store the desired information and which may be accessed by a computer.

“Computer-readable signal media” may refer to a signal-bearing medium that is configured to transmit instructions to the hardware of the computing device 902, such as via a network. Signal media typically may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as carrier waves, data signals, or other transport mechanism. Signal media also include any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.

As previously described, hardware elements 910 and computer-readable media 906 are representative of instructions, modules, programmable device logic and/or fixed device logic implemented in a hardware form that may be employed in some implementations to implement at least some aspects of the techniques described herein. Hardware elements may include components of an integrated circuit or on-chip system, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a complex programmable logic device (CPLD), and other implementations in silicon or other hardware devices. In this context, a hardware element may operate as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element as well as a hardware device utilized to store instructions for execution, e.g., the computer-readable storage media described previously.

Combinations of the foregoing may also be employed to implement various techniques and modules described herein. Accordingly, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage media and/or by one or more hardware elements 910. The computing device 902 may be configured to implement particular instructions and/or functions corresponding to the software and/or hardware modules. Accordingly, implementation of modules that are executable by the computing device 902 as software may be achieved at least partially in hardware, e.g., through use of computer-readable storage media and/or hardware elements 910 of the processing system. The instructions and/or functions may be executable/operable by one or more articles of manufacture (for example, one or more computing devices 902 and/or processing systems 904) to implement techniques, modules, and examples described herein.

As further illustrated in FIG. 9, the example system 900 enables ubiquitous environments for a seamless user experience when running applications on a personal computer (PC), a television device, and/or a mobile device. Services and applications run substantially similar in all three environments for a common user experience when transitioning from one device to the next while utilizing an application, playing a video game, watching a video, and so on.

In the example system 900, multiple devices are interconnected through a central computing device. The central computing device may be local to the multiple devices or may be located remotely from the multiple devices. In one embodiment, the central computing device may be a cloud of one or more server computers that are connected to the multiple devices through a network, the Internet, or other data communication link.

In one embodiment, this interconnection architecture enables functionality to be delivered across multiple devices to provide a common and seamless experience to a user of the multiple devices. Each of the multiple devices may have different physical requirements and capabilities, and the central computing device uses a platform to enable the delivery of an experience to the device that is both tailored to the device and yet common to all devices. In one embodiment, a class of target devices is created and experiences are tailored to the generic class of devices. A class of devices may be defined by physical features, types of usage, or other common characteristics of the devices.

In various implementations, the computing device 902 may assume a variety of different configurations, such as for computer 914, mobile 916, and television 918 uses. Each of these configurations includes devices that may have generally different constructs and capabilities, and thus the computing device 902 may be configured according to one or more of the different device classes. For instance, the computing device 902 may be implemented as the computer 914 class of a device that includes a personal computer, desktop computer, a multi-screen computer, laptop computer, netbook, and so on.

The computing device 902 may also be implemented as the mobile 916 class of device that includes mobile devices, such as a mobile phone, portable music player, portable gaming device, a tablet computer, a wearable device, a multi-screen computer, and so on. The computing device 902 may also be implemented as the television 918 class of device that includes devices having or connected to generally larger screens in casual viewing environments. These devices include televisions, set-top boxes, gaming consoles, and so on.

The techniques described herein may be supported by these various configurations of the computing device 902 and are not limited to the specific examples of the techniques described herein. For example, functionalities discussed with reference to the source systems 102 and/or the host systems 104 may be implemented all or in part through use of a distributed system, such as over a “cloud” 920 via a platform 922 as described below.

The cloud 920 includes and/or is representative of a platform 922 for resources 924. The platform 922 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 920. The resources 924 may include applications and/or data that can be utilized while computer processing is executed on servers that are remote from the computing device 902. Resources 924 can also include services provided over the Internet and/or through a subscriber network, such as a cellular or Wi-Fi network.

The platform 922 may abstract resources and functions to connect the computing device 902 with other computing devices. The platform 922 may also serve to abstract scaling of resources to provide a corresponding level of scale to encountered demand for the resources 924 that are implemented via the platform 922. Accordingly, in an interconnected device embodiment, implementation of functionality described herein may be distributed throughout the system 900. For example, the functionality may be implemented in part on the computing device 902 as well as via the platform 922 that abstracts the functionality of the cloud 920.

Discussed herein are a number of methods that may be implemented to perform techniques discussed herein. Aspects of the methods may be implemented in hardware, firmware, or software, or a combination thereof. The methods are shown as a set of steps that specify operations performed by one or more devices and are not necessarily limited to the orders shown for performing the operations by the respective blocks. Further, an operation shown with respect to a particular method may be combined and/or interchanged with an operation of a different method in accordance with one or more implementations. Aspects of the methods can be implemented via interaction between various entities discussed above with reference to the environment 100.

In the discussions herein, various different implementations are described. It is to be appreciated and understood that each implementation described herein can be used on its own or in connection with one or more other implementations described herein. Further aspects of the techniques discussed herein relate to one or more of the following implementations.

A system for obtaining a predictive model, the system including: at least one processor; and one or more computer-readable storage media including instructions stored thereon that, responsive to execution by the at least one processor, cause the system perform operations including: calculating a gradient value based on a data set applied to a data model, the gradient value including a weight value calculated for the data model; communicating the gradient value to an external service; receiving an average gradient value from the external service; applying the average gradient value to the data model; and obtaining, based on ascertaining that a termination criterion occurs, a predictive model that represents a trained version of the data model.

In addition to any of the above described systems, any one or combination of: wherein said calculating includes using a backpropagation procedure to train the data model using the data set; wherein said calculating includes: dividing the data set into a set of mini-batches; and calculating the gradient value using a particular mini-batch of the set of mini-batches; wherein said calculating includes: dividing the data set into a set of mini-batches; and calculating the gradient value using a particular mini-batch of the set of mini-batches, wherein the termination criterion includes determining that each mini-batch of the set of mini-batches is evaluated to generate a respective gradient value; wherein said applying includes applying the average gradient value to update a weight value of the data model; wherein the predictive model includes a neural network trained using the average gradient value; wherein the operations further include: applying a set of input data to the predictive model; ascertaining an output of the predictive model; and performing an action based on the output of the predictive model.

A computer-implemented method for obtaining a predictive model, the method including: receiving multiple gradient values from multiple different source systems; generating an average gradient value from the multiple gradient values; adding a noise term to the average gradient value to generate a noisy gradient average; communicating the noisy gradient average to the multiple different source systems; and obtaining a predictive model trained using the noisy gradient average.

In addition to any of the above described methods, any one or combination of: wherein said adding the noise term includes adding a Laplace-distributed random number to the average gradient value to generate the noisy gradient average; wherein said adding the noise term includes performing a garbled circuits protocol using the average gradient value; wherein the predictive model includes a neural network trained using the noisy gradient average.

A computer-implemented method for obtaining a predictive model, the method including: calculating a gradient value based on a data set applied to a data model; generating a perturbed gradient value based on the gradient value and a perturbation value; communicating the perturbed gradient value to a first host system; communicating the perturbation value to a second host system; receiving an average gradient value from one or more of the first host system or the second host system, the average gradient value calculated based on the perturbed gradient value and the perturbation value; applying the average gradient value to the data model; and obtaining a predictive model that represents a trained version of the data model, the data model trained at least in part using the average gradient value.

In addition to any of the above described methods, any one or combination of: wherein said calculating includes applying backpropagation to the data model and using the data set to calculate the gradient value; wherein said calculating includes: dividing the data set into a set of mini-batches; and calculating the gradient value using a particular mini-batch of the set of mini-batches; wherein said generating the perturbed gradient value includes generating the perturbation value as a random vector, and adding the random vector to the gradient value to generate the perturbed gradient value; wherein said applying includes applying a weight value from the average gradient value to the data model; wherein said obtaining is performed in response to ascertaining that a termination criterion occurs; wherein the average gradient value is calculated using a garbled circuits protocol; wherein the predictive model includes a neural network trained using the average gradient value; further including: applying a set of input data to the predictive model; ascertaining an output of the predictive model; performing an action based on the output of the predictive model.

Techniques for using data sets for a predictive model are described. Although implementations are described in language specific to structural features and/or methodological acts, it is to be understood that the implementations defined in the appended claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed implementations.

Claims

1. A system comprising:

at least one processor; and
one or more computer-readable storage media including instructions stored thereon that, responsive to execution by the at least one processor, cause the system perform operations including: calculating a gradient value based on a data set applied to a data model, the gradient value including a weight value calculated for the data model; communicating the gradient value to an external service; receiving an average gradient value from the external service; applying the average gradient value to the data model; and obtaining, based on ascertaining that a termination criterion occurs, a predictive model that represents a trained version of the data model.

2. A system as recited in claim 1, wherein said calculating comprises using a backpropagation procedure to train the data model using the data set.

3. A system as recited in claim 1, wherein said calculating comprises:

dividing the data set into a set of mini-batches; and
calculating the gradient value using a particular mini-batch of the set of mini-batches.

4. A system as recited in claim 1, wherein said calculating comprises:

dividing the data set into a set of mini-batches; and
calculating the gradient value using a particular mini-batch of the set of mini-batches, wherein the termination criterion comprises determining that each mini-batch of the set of mini-batches is evaluated to generate a respective gradient value.

5. A system as recited in claim 1, wherein said applying comprises applying the average gradient value to update a weight value of the data model.

6. A system as recited in claim 1, wherein the predictive model comprises a neural network trained using the average gradient value.

7. A system as recited in claim 1, wherein the operations further include:

applying a set of input data to the predictive model;
ascertaining an output of the predictive model; and
performing an action based on the output of the predictive model.

8. A computer-implemented method, comprising:

receiving multiple gradient values from multiple different source systems;
generating an average gradient value from the multiple gradient values;
adding a noise term to the average gradient value to generate a noisy gradient average;
communicating the noisy gradient average to the multiple different source systems; and
obtaining a predictive model trained using the noisy gradient average.

9. A method as described in claim 8, wherein said adding the noise term comprises adding a Laplace-distributed random number to the average gradient value to generate the noisy gradient average.

10. A method as described in claim 8, wherein said adding the noise term comprises performing a garbled circuits protocol using the average gradient value.

11. A method as described in claim 8, wherein the predictive model comprises a neural network trained using the noisy gradient average.

12. A computer-implemented method, comprising:

calculating a gradient value based on a data set applied to a data model;
generating a perturbed gradient value based on the gradient value and a perturbation value;
communicating the perturbed gradient value to a first host system;
communicating the perturbation value to a second host system;
receiving an average gradient value from one or more of the first host system or the second host system, the average gradient value calculated based on the perturbed gradient value and the perturbation value;
applying the average gradient value to the data model; and
obtaining a predictive model that represents a trained version of the data model, the data model trained at least in part using the average gradient value.

13. A method as described in claim 12, wherein said calculating comprises applying backpropagation to the data model and using the data set to calculate the gradient value.

14. A method as described in claim 12, wherein said calculating comprises:

dividing the data set into a set of mini-batches; and
calculating the gradient value using a particular mini-batch of the set of mini-batches.

15. A method as described in claim 12, wherein said generating the perturbed gradient value comprises generating the perturbation value as a random vector, and adding the random vector to the gradient value to generate the perturbed gradient value.

16. A method as described in claim 12, wherein said applying comprises applying a weight value from the average gradient value to the data model.

17. A method as described in claim 12, wherein said obtaining is performed in response to ascertaining that a termination criterion occurs.

18. A method as described in claim 12, wherein the average gradient value is calculated using a garbled circuits protocol.

19. A method as described in claim 12, wherein the predictive model comprises a neural network trained using the average gradient value.

20. A method as described in claim 12, further comprising:

applying a set of input data to the predictive model;
ascertaining an output of the predictive model;
performing an action based on the output of the predictive model.
Patent History
Publication number: 20180268283
Type: Application
Filed: Jun 30, 2017
Publication Date: Sep 20, 2018
Applicant: Microsoft Technology Licensing, LLC (Redmond, WA)
Inventors: Ran Gilad-Bachrach (Hogla), Kim Henry Martin Laine (Seattle, WA), Melissa E. Chase (Seattle, WA), Kristin Estella Lauter (Redmond, WA)
Application Number: 15/639,557
Classifications
International Classification: G06N 3/04 (20060101); G06N 3/08 (20060101); G06F 17/30 (20060101);