# ONLINE HYPERPARAMETER TUNING IN DISTRIBUTED MACHINE LEARNING

The disclosed embodiments provide a system for performing online hyperparameter tuning in distributed machine learning. During operation, the system uses input data for a first set of versions of a statistical model for a set of entities to calculate a batch of performance metrics for the first set of versions. Next, the system applies an optimization technique to the batch to produce updates to a set of hyperparameters for the statistical model. The system then uses the updates to modulate the execution of a second set of versions of the statistical model for the set of entities. When a new entity is added to the set of entities, the system updates the set of hyperparameters with a new dimension for the new entity.

## Latest LinkedIn Patents:

**Description**

**RELATED APPLICATION**

The subject matter of this application is related to the subject matter in a co-pending non-provisional application by inventors Xu Miao, Yitong Zhou, Joel D. Young, Lijun Tang and Anmol Bhasin, entitled “Version Control for Asynchronous Distributed Machine Learning,” having Ser. No. 14/864,474 and filing date 24 Sep. 2015 (Attorney Docket No. LI-P1583.LNK.US).

**BACKGROUND**

**Field**

The disclosed embodiments relate to distributed machine learning. More specifically, the disclosed embodiments relate to techniques for performing online hyperparameter tuning in distributed machine learning.

**Related Art**

Analytics may be used to discover trends, patterns, relationships, and/or other attributes related to large sets of complex, interconnected, and/or multidimensional data. In turn, the discovered information may be used to gain insights and/or guide decisions and/or actions related to the data. For example, business analytics may be used to assess past performance, guide business planning, and/or identify actions that may improve future performance.

However, significant increases in the size of data sets have resulted in difficulties associated with collecting, storing, managing, transferring, sharing, analyzing, and/or visualizing the data in a timely manner. For example, conventional software tools and/or storage mechanisms may be unable to handle the petabytes or exabytes of loosely structured data that is generated on a daily and/or continuous basis from multiple, heterogeneous sources. Instead, management and processing of “big data” may require massively parallel software running on a large number of physical servers and/or nodes, as well as synchronization among the servers and/or nodes.

Consequently, big data analytics may be facilitated by mechanisms for efficiently and/or effectively collecting, storing, managing, compressing, transferring, sharing, analyzing, and/or visualizing large data sets.

**BRIEF DESCRIPTION OF THE FIGURES**

In the figures, like reference numerals refer to the same figure elements.

**DETAILED DESCRIPTION**

The following description is presented to enable any person skilled in the art to make and use the embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing code and/or data now known or later developed.

The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computer-readable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computer-readable storage medium.

Furthermore, methods and processes described herein can be included in hardware modules or apparatus. These modules or apparatus may include, but are not limited to, an application-specific integrated circuit (ASIC) chip, a field-programmable gate array (FPGA), a dedicated or shared processor that executes a particular software module or a piece of code at a particular time, and/or other programmable-logic devices now known or later developed. When the hardware modules or apparatus are activated, they perform the methods and processes included within them.

The disclosed embodiments provide a method, apparatus, and system for processing data. More specifically, the disclosed embodiments provide a method, apparatus, and system for performing asynchronous distributed machine learning. As shown in **102** and a number of trainers (e.g., trainer **1** **104**, trainer y **106**), which interact with one another to produce multiple versions of a statistical model **108**. Each of these components is described in further detail below.

Statistical model **108** may be used to perform statistical inference, estimation, classification, clustering, personalization, recommendation, optimization, hypothesis testing, and/or other types of data analysis. For example, statistical model **108** may be a regression model, artificial neural network, support vector machine, decision tree, naïve Bayes classifier, Bayesian network, decision tree, random forest, gradient boosted tree, hierarchical model, and/or ensemble model. The results of such analysis may be used to discover relationships, patterns, and/or trends in the data; gain insights from the input data; and/or guide decisions or actions related to the data. For example, statistical model **108** may be used to analyze input data related to users, organizations, applications, websites, content, and/or other categories. Statistical model **108** may then be used to output scores, provide recommendations, make predictions, manage relationships, and/or personalize user experiences based on the data.

In addition, statistical model **108** may be trained and/or adapted to new data received on the trainers. For example, the trainers may execute on electronic devices (e.g., personal computers, laptop computers, mobile phones, tablet computers, portable media players, digital cameras, etc.) that produce updates **114**-**116** to statistical model **108** based on user feedback from users of the electronic devices. Alternatively, the trainers may generate updates **114**-**116** to statistical model **108** in a distributed fashion on different subsets of training and/or input data from server **102** and/or another centralized data source.

Each update may represent a local version of statistical model **108** that is adapted to input data on the corresponding trainer. In addition, the trainers may produce updates **114**-**116** from global versions (e.g., global version **1** **110**, global version x **112**) of statistical model **108**. For example, a trainer may periodically receive a latest global version of statistical model **108** from server **102**. The trainer may then generate a local version as an update to the global version by providing user feedback and/or other input data as training data to the global version.

In turn, the trainers may transmit updates **114**-**116** to server **102**, and server **102** may merge updates **114**-**116** into subsequent global versions of statistical model **108**. After a new global version of statistical model **108** is created, server **102** may transmit the new global version to the trainers to propagate updates **114**-**116** included in the new global version to the trainers. The trainers may then produce additional updates from the new global version and provide the additional updates to server **102** for subsequent generation of additional global versions of statistical model **108**.

Thus, statistical model **108** may be iteratively trained through the bidirectional transmission of data from server **102** to the trainers and from the trainers to server **102**. Moreover, alternating the creation of global versions of statistical model **108** from input data aggregated from multiple trainers with generation of local versions from the global versions on the trainers may prevent overfitting of statistical model **108** to input data on individual trainers.

In one or more embodiments, local versions of statistical model **108** are produced on the trainers to personalize statistical model **108** to users of the trainers. More specifically, the trainers may obtain a global version of statistical model **108**, which tracks the behavior or preferences of all users, from server **102**. Each trainer may then update the global version in real-time based on user input or feedback from a user associated with the trainer, thereby creating a user-specific model for the user. For example, the trainer may track and/or receive the user's searches, clicks, likes, dislikes, views, text input, conversions, and/or other implicit or explicit feedback during a user session with a job search tool. As each piece of feedback is received from the user, the trainer may provide the feedback as training data for statistical model **108** to generate one or more updates (e.g., updates **114**-**116**) that customize the output of statistical model **108** to the user's current job search activity. Consequently, the trainer may generate recommendations of job listings based on aggregated training data used to produce the global version, as well as the user's input during the current session with the job search tool.

Alternatively, some or all local versions of statistical model **108** may be personalized to other types of entities. Continuing with the previous example, local versions of statistical model **108** may also include job-specific models that identify the relevance or attraction of the corresponding job listings to certain user features. Thus, a given job listing may be recommended to a particular user based on a propensity score that combines output from the corresponding job-specific model, the user-specific model for the user, and the global version of statistical model **108**.

In one or more embodiments, server **102** and the trainers perform asynchronous distributed machine learning, in which barriers or locks for synchronizing the updating of statistical model **108** are fully removed. For example, server **102** may update statistical model **108** by producing global versions of statistical model **108** and transmitting the global versions to the trainers independently from receiving updates **114**-**116** to the global versions from the trainers. Since updates to the global versions are not affected by variations in the processing speed, computational power, and/or network delay of individual trainers, statistical model **108** may be updated faster than distributed machine learning techniques that include barriers or locks for synchronizing statistical model updates.

More specifically, the system of **132** in server **102** may track global versions (e.g., global version **1** **110**, global version x **112**) of statistical model **108** using a set of version identifiers (e.g., version identifier **1** **122**, version identifier x **124**). Each version identifier may represent a given global version of statistical model **108**, which is created by a merging apparatus **130** that merges a subset of updates **114**-**116** from the trainers into one or more previous global versions of statistical model **108**.

To track the subset of updates that have been merged into each global version, the corresponding version identifier may be generated from a set of update identifiers (e.g., update identifiers **1** **118**, update identifiers x **120**) for the subset of updates. For example, each update identifier may specify the trainer from which the corresponding update was received, as well as the global version of statistical model **108** used to produce the update. Version-management apparatus **132** may concatenate, hash, and/or otherwise combine update identifiers for a given subset of updates **114**-**116** into the version identifier for the global version that will be produced from the updates. In turn, merging apparatus **130** may use the version identifiers to ensure that all updates **114**-**116** from the trainers have been merged into the global versions of statistical model **108** while avoiding merging of each update more than once into the global versions.

By tracking updates **114**-**116** to statistical model **108** in version identifiers for the global versions, version-management apparatus **132** may allow the global versions to be generated without synchronization barriers associated with updates **114**-**116**. Asynchronous updating of local and global versions of statistical models is described in a co-pending non-provisional application by inventors Xu Miao, Yitong Zhou, Joel D. Young, Lijun Tang and Anmol Bhasin, entitled “Version Control for Asynchronous Distributed Machine Learning,” having Ser. No. 14/864,474 and filing date 24 Sep. 2015 (Attorney Docket No. LI-P1583.LNK.US), which is incorporated herein by reference.

In one or more embodiments, merging apparatus **130** includes functionality to perform online tuning of hyperparameters (e.g., hyperparameters **1** **126**, hyperparameters y **128**) for some or all local and/or global versions of statistical model **108**. Unlike internal parameters (e.g., coefficients, weights, etc.) used by statistical model **108** to generate scores, classifications, recommendations, estimates, predictions, and/or other inferences or output, the hyperparameters may define “higher-level” properties of statistical model **108**.

For example, the hyperparameters may include a regularization parameter that controls the amount of personalization of each local version of statistical model **108**. When the regularization parameter is **0**, the local version is fully personalized to the corresponding user and does not include any adaptation to the behavior of other users. Thus, a value of **0** for the regularization parameter may result in the creation of a local version of statistical model **108** that is completely separate from any global versions of statistical model **108**. As the regularization parameter increases, the personalization of the local version and convergence of statistical model **108** decrease.

In another example, the hyperparameters may include a convergence parameter that adjusts the rate of convergence of global versions of statistical model **108**, with a higher value for the convergence parameter resulting in a faster rate of convergence to a less optimal result. Thus, the convergence parameter may be selected to balance the convergence rate and the performance of statistical model **108**.

In a third example, the hyperparameters may include a clustering parameter that controls the amount of clustering (e.g., number of clusters) in a clustering technique and/or classification technique that utilizes clusters. In a fourth example, the hyperparameters may specify a feature complexity for features inputted into statistical model **108**, such as the number of topics or items in n-grams used during natural language processing. In a fifth example, the hyperparameters may include a model training parameter that controls training of statistical model **108**, such as a step size or momentum in a gradient descent technique. In a sixth example, the hyperparameters may include a model selection parameter that specifies the type of statistical model **108** used with the system of

As discussed in further detail below, such hyperparameter tuning may be performed after a pre-specified amount of user feedback and/or other training data is collected for use in producing updates **114**-**116** to the corresponding local versions of statistical model **108**. Prior to producing an update to a local version, a collected “batch” of input data may be used to calculate a performance metric that reflects the performance of the current version of statistical model **108**. Performance metrics related to multiple versions of statistical model **108** may then be provided to merging apparatus **130** and used with an optimization technique to update hyperparameters for various versions of statistical model **108** in a way that improves the performance metrics over time. Consequently, the system of **108**.

Those skilled in the art will appreciate that the system of

Second, one or more instances of server **102** may be used to aggregate updates **114**-**116** from the clients into global versions of statistical model **108**. If multiple instances of server **102** exist, each instance may be matched to a different subset of clients based on attributes of the instance and the clients. For example, the network bandwidth, processor resources, and/or memory on a given instance of server **102** may be matched to a client based on the rate at which the client generates updates to statistical model **108**, the importance of the client, the revenue associated with the client, and/or other metrics or characteristics associated with the client. Different global versions of statistical model **108** produced by the instances may then be merged into a master global version through additional communication among the server **102** instances.

Moreover, individual instances of server **102** may be used to perform different types of training and/or updating related to statistical model **108**. For example, different instances of server **102** may be used to perform updating of hyperparameters and/or global versions of statistical model **108** for different applications (e.g., job recommendations, skill recommendations, ad recommendations, propensity scores, reputation scores, connection strengths, etc.).

**210** in distributed machine learning in accordance with the disclosed embodiments. As mentioned above, hyperparameters **210** may define properties of a statistical model, such as statistical model **108** of **210** may control regularization, convergence, clustering, feature complexity, model training, model selection, decay, thresholds, and/or optimization of other aspects of the statistical model.

In addition, the statistical model may have multiple local versions **202** and one or more global versions **204**. Individual local versions **202** may be personalized to specific users, recommendations, job listings, advertisements, content items, and/or other types of entities **218**. Output **212** from each local version may be displayed and/or otherwise presented to one or more users, and user feedback **206** and/or other input data related to output **212** may be collected and/or tracked. For example, a local version of the statistical model may be loaded during a user session with an online professional network. The local version may include one or more updates **222** to the statistical model that were generated based on historic and/or prior user feedback **206** from the user. Alternatively, the local version may be copied from the most recent global version of the statistical model, if input data related to the corresponding entity is lacking and/or a previous local version for the entity has been replaced by the global version.

Continuing with the previous example, the local version may output predictions, scores, and/or inferences related to job listings, advertisements, articles, potential connections, and/or other content shown within the online professional network during the user session. In turn, output **212** from the local version may be used to select a subset of the content as recommendations (e.g., job recommendations, connection recommendations within a “People You May Know” feature, content items within a “news feed,” advertisements, etc.) for display to the user during the user session. Output **212** may also, or instead, be provided to the user via channels outside the user session (e.g., email, text message, etc.) and/or used to modulate other types of interaction with the user.

User feedback **206** related to output **212** may additionally be collected during the user session as clicks, views, searches, likes, dislikes, comments, shares, applications to job listings, and/or other interaction with the online professional network. Each piece of user feedback **206** may be included in training data that is applied to parameters **224** of the local version to generate an update (e.g., updates **222**) to the local version. Consequently, the output of the local version may be adapted to the user's real-time behavior or preferences during the user session.

In one or more embodiments, updates **222** are made to local versions **202** of the statistical model by training local versions **202** on individual batches **214** containing pre-specified amounts of user feedback **206** and/or other input data associated with the corresponding entities. For example, each piece of user feedback **206** collected in response to output **212** may be classified as a positive response (e.g., a click, like, positive comment, share, upvote, follow, etc.) or a negative response (e.g., a view, dislike, downvote, negative comment, hide, unfollow, etc.). A batch of user feedback **206** may be defined as a pre-specified number of negative responses (e.g., **100**) and/or any number of negative responses plus a positive response from a given user and/or for a given entity. After enough user feedback **206** is collected to form a batch for a given local version of the statistical model, the batch (e.g., batches **214**) may be provided as additional training data that is used to generate an update to parameters **224** (e.g., regression coefficients, neural network weights, etc.) of the local version.

Updates **222** to parameters **224** of multiple local versions **202** of the statistical model may periodically be merged into a new global version of the statistical model. For example, each update to a local version of the statistical model may be transmitted by a trainer to a server, such as server **102** of **206** collected during individual user sessions with a set of users. Thus, the statistical model may be continuously updated through the creation of local versions **202** of the statistical model from global versions of the statistical model by the trainers and the subsequent merging of the personalized versions into new global versions of the statistical model by the server.

As shown in **214** of user feedback **206** may also be used to calculate performance metrics **208** for the corresponding local versions **202**. More specifically, each batch of user feedback **206** may be used as input data to a corresponding local version of the statistical model prior to using the batch to update the local version. Output **212** generated by the local version in response to the input data and labels (e.g., outcomes) associated with the input data may then be used to calculate a receiver operating characteristic (ROC) area under the curve (AUC) as a performance metric for the local version.

In one or more embodiments, the contribution of output **212** to a performance metric for a local version is discounted based on the age of the corresponding batches **214** of user feedback **206** used to produce output **212**. Continuing with the previous example, the performance metric may be calculated as a weighted average of ROC AUC values for the local version, with each ROC AUC value calculated by the local version from a different batch of user feedback **206**. Within the weighted average, each AUC value may be multiplied by a weight that decreases as the age of the batch of user feedback **206** from which the AUC value is calculated increases. For example, each historic batch may have a weight that decays or discounts by **0**.**95** every time a new batch is received and/or a weight that decays with time in seconds or minutes. To reduce noise and/or unnecessary computation associated with the performance metric, a batch of user feedback **206** may be removed from the calculation of the performance metric once the weight drops below an eviction threshold. Each ROC AUC value may also, or instead, be calculated using individual data points that are discounted by time.

After an updated performance metric is calculated for a local version of the statistical model (e.g., using a batch of recently received user feedback **206** and/or batches **214** of historic user feedback **206** related to the local version), the corresponding client and/or trainer task may transmit the performance metric to the server. In turn, the server may incrementally update hyperparameters **210** using batches **216** of performance metrics **208**, with each batch containing a pre-specified number of performance metrics generated from multiple batches **214** of user feedback **206** for various local versions **202** of the statistical model. For example, an update to hyperparameters **210** may be made after **100** performance metrics **208** are collected from various local versions **202** of the statistical model.

More specifically, an optimization technique such as particle swarm optimization may be applied to a given batch of performance metrics **208** to explore the search space for hyperparameters **210**. The particle swarm optimization may utilize a set of “particles” to explore the search space; within the search space, each particle may have a position representing a different set of values for hyperparameters **210** (i.e., a candidate solution for the optimization technique), as well as a velocity that is used to iteratively update the particle's position. The position and/or velocity may be initially randomized and/or set to a default value. Attributes of the particles (e.g., number of particles, particle momentum, initial positions and velocities, etc.) may be specified by one or more “hyper-hyperparameters,” which may be fixed or tuned separately.

Entities **218** for which hyperparameters **210** are tuned may be represented as dimensions **220** in the search space of hyperparameters **210**. For example, the particle swarm optimization technique may model the search space with a different dimension for each entity for which a local or personalized version of the statistical model is generated. When a new entity (e.g., user, content item, job listing, advertisement, recommendation, etc.) is created or added for use with the statistical model, a new dimension representing the entity may be added to the search space for hyperparameters **210**, and all particles in the search space may be updated with default and/or randomized positions and velocities within the new dimension.

After a batch of performance metrics **208** is received by the server, the server may use the batch to update the particles' positions and/or velocities so that each particle is deflected toward the current global best position for the entire set of particles, as well as the individual particle's historic best position in the search space. Continuing with the previous example, each particle may include a position represented by an n-dimensional vector for “n” different entities. Each element in the vector may store a regularization parameter value ranging from **0** to **1** to customize the level of personalization of the statistical model to the corresponding entity. Thus, a user, job listing, advertisement, and/or other entity with features that are relatively distinct from other entities **218** may have a regularization parameter that is closer to 0 to increase the level of personalization associated with the corresponding local version. On the other hand, an entity with characteristics that overlap significantly with those of other entities **218** may have a regularization parameter that is closer to **1** to mitigate overfitting of the statistical model to the entity.

When a new entity is created or introduced for personalization by the statistical model, a separate regularization parameter value for the entity may be tracked by adding a new element or “dimension” representing the entity to all vectors representing the positions of particles in the particle swarm optimization. The new element may initially be set to a default position and velocity of 0 for all particles. Alternatively, the position and velocity of the new element may be set to randomized values and/or the position and velocity of the previous best-performing version of statistical model **108**. To enable subsequent optimization of the regularization parameter value for the entity, the position and velocity along that dimension may subsequently be randomized and/or updated according to the positions, velocities, and/or performance metrics **208** associated with the particles in the search space.

When a batch of user feedback **206** is received for a given entity, the corresponding trainer may calculate multiple performance metrics **208** for the local version associated with the entity using the batch, with each performance metric calculated using a hyperparameter value from a different particle in the particle swarm optimization. Continuing with the previous example, the trainer may obtain all current particle positions in the hyperparameter search space from the server and obtain a set of regularization parameter values for the entity from the dimension representing the entity in the particles. The trainer may use the local version, the most recent global version of the statistical model, the batch of user feedback **206**, and the regularization parameter values to calculate a set of performance metrics **208** as a set of ROC AUCs for the entity across all particle positions. More specifically, a regularization parameter value from each particle position may be used to produce an overall output from the statistical model for the entity as a weighted combination of the output of the local version and the output of the global version. The overall output may then be compared with outcomes associated with the corresponding input data (e.g., user feedback **206**) to calculate a true positive rate and a false positive rate for each batch of input data for the entity. In turn, the true positive and false positive rates for multiple batches **214** of input data may be used to produce a set of ROC AUCs that are averaged or otherwise aggregated to generate a single performance metric for the entity and particle position. During aggregation of the ROC AUCs, each ROC AUC may also be weighted so that ROC AUCs from older batches **214** of user feedback **206** are discounted according to the age of each batch, as described above.

After performance metrics **208** for all particles are calculated from a batch of user feedback **206** for a given entity, the trainer may transmit the calculated performance metrics **208** to the server. The server may then aggregate performance metrics for all entities from all trainers into an average performance metric across all entities for each particle position. As with the calculation of perentity performance metrics **208**, the average performance metric for each particle may optionally include time-based discounting of performance metrics **208**.

Finally, after the server has received enough performance metrics **208** to form a batch (e.g., after new performance metrics **208** are received for **100** entities), the server may update the average performance metrics with the performance metrics and use the average performance metrics to update the positions and velocities of the particles. As mentioned above, the particle swarm optimization technique may update the positions and velocities of the particles in the search space for hyperparameters **210** so that each particle is guided towards the position with the best performance in the entire swarm, as well as the particle's historic best-performing position. The best-performing version of the statistical model may also be used to populate an updated set of parameters **224** and/or hyperparameters **210** for local versions **202** and/or global versions **204** of the statistical model. For example, a vector of values representing the position may be transmitted as regularization parameter values for all entities **218** to execution engines for the statistical model, and the execution engines may use the transmitted values to tune the level of personalization of the statistical model to each entity. The server and/or execution engines may also optionally replace parameters **224** of a subset of local versions **202** and/or global versions **204** with performance metrics **208** that fall below a threshold with parameters **224** of higher-performing versions of the statistical model.

After the average performance metrics have been updated with a batch of performance metrics **208**, the server may optionally replace the positions of a subset of “underperforming” particles (i.e., particles with positions that produce low performance metrics **208**) with the positions and/or velocities of one or more of the highest performing particles (e.g., particle positions that produce the highest performance metrics **208**). The new positions of the particles may represent new hyperparameters **210** that are applied to local and/or global versions of the statistical model. In turn, the new hyperparameters **210** may be used to evaluate the performance of the new positions during subsequent updating of hyperparameters **210** and/or various versions of the statistical model. Consequently, hyperparameters **210** may iteratively be improved using batches **216** of performance metrics **208** for multiple versions of the statistical model, which in turn are calculated from batches **214** of user feedback **206** and/or other input data as the input data is received.

Those skilled in the art will appreciate that hyperparameters **210** may be updated in other ways. For example, batches **216** of performance metrics **208** may be used with grid search, Bayesian optimization, gradient-based optimization, meta-optimization technique, and/or another optimization technique. In another example, the dimensionality of the search space for hyperparameters **210** may be reduced by clustering entities with similar features and tuning hyperparameter values for each cluster of entities instead of for individual entities. A separate global version may also be developed for each cluster to improve personalization for entities in the cluster while enabling training of the global version from the larger data set for the cluster.

Initially, input data for a set of versions of a statistical model for a set of entities is used to calculate a batch of performance metrics for the versions (operation **302**). For example, a performance metric for each version may be calculated as an aggregate ROC AUC that is based on the true positive and false positive rates associated with a set of outputs generated by the version from a set of input data, as well as labels (e.g., outcomes) associated with the input data. The performance metric may be calculated after a pre-specified amount of input data is received for the corresponding entity (e.g., as user feedback and/or another stream of data associated with the entity). The ROC AUC may also be calculated in a way that discounts contributions of the outputs to the ROC AUC based on the age of the input data used to produce the outputs (e.g., by multiplying each historic “batch” of input data by a discount factor every time a new batch is received). A pre-specified number of performance metrics calculated for various versions of the statistical model may then be aggregated into a batch.

Next, the input data is used to produce a second set of versions of the statistical model for the entities (operation **304**). For example, after a pre-specified amount of input data is received for a given entity, the pre-specified amount is used as training data to produce an update to a local version of the statistical model for the entity. The updated local version and updates to other local versions of the statistical model for other entities may then be merged into a global version of the statistical model, asynchronously from generating the updates to the local versions.

An optimization technique is then applied to the batch of performance metrics to produce updates to a set of hyperparameters for the statistical model (operation **306**), as described in further detail below with respect to

The updates are also used to modulate execution of the second set of versions (operation **308**). For example, a regularization parameter may be used to customize the level of personalization associated with each local version of the statistical model, while a convergence parameter may adjust the rate of convergence of global versions of the statistical model.

When a new entity is added to the set of entities, the hyperparameters are updated with a new dimension for the new entity (operation **310**). For example, a vector of hyperparameter values may be updated with a new element representing the entity, which is initialized with a default value. During subsequent updating of the hyperparameters using the optimization technique, the element may be randomized to allow the performance of the default value to be compared with the randomized value. In turn, the performance comparison may facilitate subsequent optimization of the element's value with other hyperparameter values in the search space (e.g., using subsequent batches of performance metrics). Alternatively, the element may be initialized with a random value and subsequently updated using a velocity that is also randomized with respect to the new dimension.

First, a set of particles is used to explore a search space for the hyperparameters (operation **402**). Each particle may have a position and a velocity in the search space, with the coordinates of the position representing a candidate solution for optimizing the hyperparameters and the velocity used to iteratively update the position. In addition, each dimension in the position may represent a different entity for which a hyperparameter value is to be optimized, as discussed above.

Next, the calculated performance metrics are used to update a set of average performance metrics for positions of the particles (operation **404**). The calculated performance metrics may be produced from batches of input data to multiple versions of the statistical model, such as local and/or global versions of the statistical model. In turn, the calculated performance metrics may be incorporated into a running “average” performance metric across all entities (e.g., as represented by different versions of the statistical model) for each particle. The running average may be calculated from current and historic batches of input data for each entity, with older batches optionally discounted by the age of the batch. In turn, the updated average performance metrics are used to identify the particle position with the highest average performance metric (operation **406**).

The average performance metrics are also used to update positions and velocities of the particles in the search space. More specifically, a first subset of particle positions with average performance metrics that fall below a threshold is removed (operation **408**) from the search space. The first subset may include particle positions with average performance metrics that do not meet a numeric threshold and/or a pre-specified number of particles or percentage of all particles with the lowest average performance metrics. The removed particle positions are also replaced with the identified particle position (e.g., the particle position with the highest average performance metric) (operation **412**). One or more particle positions may also, or instead, be replaced with random values and/or using the positions of other high-performing particles in the search space. Each particle is then deflected toward a global best position for the set of particles and a historic best position for the particle (operation **412**).

Finally, the hyperparameters are updated with values represented by the particle position (operation **414**). For example, the hyperparameters may be updated with coordinate values from the particle's position in the search space. The updated hyperparameters may then be used to modulate subsequent execution of the corresponding versions of the statistical model, as discussed above.

**500**. Computer system **500** includes a processor **502**, memory **504**, storage **506**, and/or other components found in electronic computing devices. Processor **502** may support parallel processing and/or multi-threaded operation with other processors in computer system **500**. Computer system **500** may also include input/output (I/O) devices such as a keyboard **508**, a mouse **510**, and a display **512**.

Computer system **500** may include functionality to execute various components of the present embodiments. In particular, computer system **500** may include an operating system (not shown) that coordinates the use of hardware and software resources on computer system **500**, as well as one or more applications that perform specialized tasks for the user. To perform tasks for the user, applications may obtain the use of hardware resources on computer system **500** from the operating system, as well as interact with the user through a hardware and/or software framework provided by the operating system.

In one or more embodiments, computer system **500** provides a system for performing online tuning of hyperparameters in distributed machine learning. The system may include a training apparatus and a merging apparatus, one or both of which may alternatively be termed or implemented as a module, mechanism, or other type of system component. The merging apparatus may use input data for a first set of versions of a statistical model for a set of entities to calculate a batch of performance metrics for the first set of versions. Next, the merging apparatus may apply an optimization technique to the batch to produce updates to a set of hyperparameters for the statistical model. When a new entity is added to the set of entities, the merging apparatus may update the set of hyperparameters with a new dimension for the new entity. The training apparatus may use the input data to produce the second set of versions of the statistical model and use the updates to modulate the execution of the second set of versions.

In addition, one or more components of computer system **500** may be remotely located and connected to the other components over a network. Portions of the present embodiments (e.g., server, trainers, version-management apparatus, merging apparatus, local versions, global versions, personalized versions, etc.) may also be located on different nodes of a distributed system that implements the embodiments. For example, the present embodiments may be implemented using a cloud computing system that performs asynchronous distributed machine learning and/or online hyperparameter tuning for multiple remote versions of a statistical model.

The foregoing descriptions of various embodiments have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention.

## Claims

1. A method, comprising:

- using input data for a first set of versions of a statistical model for a set of entities to calculate, by one or more computer systems, a batch of performance metrics for the first set of versions;

- applying, by the one or more computer systems, an optimization technique to the batch to produce updates to a set of hyperparameters for the statistical model;

- using the updates to modulate the execution of a second set of versions of the statistical model for the set of entities; and

- when a new entity is added to the set of entities, updating the set of hyperparameters with a new dimension for the new entity.

2. The method of claim 1, further comprising:

- using the input data to produce the second set of versions of the statistical model.

3. The method of claim 2, wherein using the input data to produce the second set of versions of the statistical model comprises:

- when a pre-specified amount of the input data is received for an entity in the set of entities, using the pre-specified amount to produce an update to a version of the statistical model for the entity.

4. The method of claim 3, wherein using the input data to produce the second set of versions of the statistical model further comprises:

- merging, into a global version of the statistical model, the update and other updates to other versions of the statistical model for other entities in the set of entities asynchronously from generating the update and the other updates.

5. The method of claim 1, wherein using the input data to calculate the set of performance metrics associated with the first set of versions of the statistical model comprises:

- using a set of outputs generated from a subset of the input data for a version of the statistical model and a set of labels associated with the subset of input data to calculate a performance metric for the version; and

- discounting contributions of the outputs to the performance metric based on a set of ages associated with the subset of the input data.

6. The method of claim 1, wherein applying the optimization technique to the set of performance metrics to produce the updates to the set of hyperparameters for the statistical model comprises:

- using a set of particles to explore a search space for the hyperparameters;

- using the calculated performance metrics to update a set of average performance metrics for a set of positions of the particles;

- identifying, from the set of positions, a particle position with a highest average performance metric;

- updating the hyperparameters with values represented by the particle position; and

- using the average performance metrics to update positions and velocities of the particles in the search space.

7. The method of claim 6, wherein using the average performance metrics to update the positions and the velocities of the particles in the search space comprises:

- removing a first subset of the positions with average performance metrics that fall below a threshold;

- replacing the first subset with the position of the particle; and

- for each particle in the set of particles, deflecting the particle toward a global best position for the set of particles and a historic best position for the particle.

8. The method of claim 1, wherein updating the set of hyperparameters with a new dimension for the new entity comprises at least one of:

- updating the hyperparameters with a default value for the new dimension; and

- updating the new dimension with a random value.

9. The method of claim 1, wherein the set of hyperparameters comprises at least one of:

- a regularization parameter;

- a clustering parameter;

- a convergence parameter;

- a feature complexity;

- a model training parameter;

- a model selection parameter;

- a decay parameter;

- a threshold; and

- a hyper-hyperparameter.

10. The method of claim 1, wherein the set of entities comprises at least one of:

- a user;

- an advertisement; and

- a recommendation.

11. An apparatus, comprising:

- one or more processors; and

- memory storing instructions that, when executed by the one or more processors, cause the apparatus to: use input data for a first set of versions of a statistical model for a set of entities to calculate a batch of performance metrics for the first set of versions; apply an optimization technique to the batch to produce updates to a set of hyperparameters for the statistical model; use the updates to modulate the execution of a second set of versions of the statistical model for the set of entities; and when a new entity is added to the set of entities, update the set of hyperparameters with a new dimension for the new entity.

12. The apparatus of claim 11, wherein the memory further stores instructions that, when executed by the one or more processors, cause the apparatus to:

- use the input data to produce the second set of versions of the statistical model.

13. The apparatus of claim 12, wherein using the input data to produce the second set of versions of the statistical model comprises:

- when a pre-specified amount of the input data is received for an entity in the set of entities, using the pre-specified amount to produce an update to a version of the statistical model for the entity.

14. The apparatus of claim 13, wherein using the input data to produce the second set of versions of the statistical model further comprises:

- merging, into a global version of the statistical model, the update and other updates to other versions of the statistical model for other entities in the set of entities asynchronously from generating the update and the other updates.

15. The apparatus of claim 11, wherein using the input data to calculate the set of performance metrics associated with the first set of versions of the statistical model comprises:

- using a set of outputs generated from a subset of the input data for a version of the statistical model and a set of labels associated with the subset of input data to calculate a performance metric for the version; and

- discounting contributions of the outputs to the performance metric based on a set of ages associated with the subset of the input data.

16. The apparatus of claim 11, wherein applying the optimization technique to the set of performance metrics to produce the updates to the set of hyperparameters for the statistical model comprises:

- using a set of particles to explore a search space for the hyperparameters;

- using the calculated performance metrics to update a set of average performance metrics for a set of positions of the particles;

- identifying, from the set of positions, a particle position with a highest average performance metric;

- updating the hyperparameters with values represented by the particle position; and

- using the average performance metrics to update positions and velocities of the particles in the search space.

17. The apparatus of claim 16, wherein using the average performance metrics to update the positions and the velocities of the particles in the search space comprises:

- removing a first subset of the positions with average performance metrics that fall below a threshold;

- replacing the first subset with the position of the particle; and

- for each particle in the set of particles, deflecting the particle toward a global best position for the set of particles and a historic best position for the particle.

18. The apparatus of claim 11, wherein updating the set of hyperparameters with a new dimension for the new entity comprises at least one of:

- updating the hyperparameters with a default value for the new dimension; and

- updating the new dimension with a random value.

19. A system, comprising:

- a merging module comprising a non-transitory computer-readable medium storing instructions that, when executed, cause the system to: use input data for a first set of versions of a statistical model for a set of entities to calculate a batch of performance metrics for the first set of versions; apply an optimization technique to the batch to produce updates to a set of hyperparameters for the statistical model; and when a new entity is added to the set of entities, update the set of hyperparameters with a new dimension for the new entity; and

- a training module comprising a non-transitory computer-readable medium storing instructions that, when executed, cause the system to: use the input data to produce a second set of versions of the statistical model; and use the updates to modulate the execution of the second set of versions.

20. The system of claim 19, wherein applying the optimization technique to the set of performance metrics to produce the updates to the set of hyperparameters for the statistical model comprises:

- using a set of particles to explore a search space for the hyperparameters;

- using the calculated performance metrics to update a set of average performance metrics for a set of positions of the particles;

- identifying, from the set of positions, a particle position with a highest average performance metric;

- updating the hyperparameters with values represented by the particle position; and

- using the average performance metrics to update positions and velocities of the particles in the search space.

**Patent History**

**Publication number**: 20180285759

**Type:**Application

**Filed**: Apr 3, 2017

**Publication Date**: Oct 4, 2018

**Applicant**: LinkedIn Corporation (Sunnyvale, CA)

**Inventors**: Ian B. Wood (Bloomington, IN), Xu Miao (Los Altos, CA), Chang-Ming Tsai (Fremont, CA), Joel D. Young (Milpitas, CA)

**Application Number**: 15/477,782

**Classifications**

**International Classification**: G06N 7/02 (20060101); G06N 3/04 (20060101); G06F 15/18 (20060101); G06N 3/08 (20060101); G06N 99/00 (20060101); G06F 17/30 (20060101);