MACHINE LEARNING ENGINE FOR DETERMINING DATA SIMILARITY

Info

Publication number: 20230124411
Type: Application
Filed: Jan 25, 2022
Publication Date: Apr 20, 2023
Inventors: Philip Frederik Sommer (New York, NY), Stefano Pasquali (New York, NY), Jerinsh Jeyapaulraj (Harrison, NJ), Yu-Li Chu (New York, NY)
Application Number: 17/583,917

Abstract

A system and method for training and using a machine-learning similarity framework are provided. During training, the similarity framework generates an ensemble of tress. The trees have different properties at each node. The similarity framework uses the ensemble of trees to determine similarity between objects. The objects are propagated through nodes of each tree in the ensemble of trees until the objects reach leaf nodes. The objects are propagated by comparing the properties at each node of the tree to the features of the objects until the objects reach the leaf nodes. The similarity framework determines a similarity score for a pair of objects in each tree and adjusts the similarity score by tree importance. The object similarity score is determined by combining the similarity scores from multiple trees in the ensemble of trees. The similarity framework generates a similarity matrix that stores object similarity scores for multiple pairs of objects.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority under 35 U.S.C. 119 to U.S. Provisional Application No. 63/256,129, filed Oct. 15, 2021, which is hereby incorporated by reference herein in its entirety.

TECHNICAL FIELD

The embodiments are directed to machine learning, and more particularity to a machine learning system for identifying object similarity.

BACKGROUND

Conventionally, similarity between two objects is determined using unsupervised learning techniques. These conventional techniques identity features of the objects, transform the features into a high-dimensional feature space, and use a clustering or K-nearest similarity algorithm to identify similarity of the objects.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a simplified diagram of a computing device for implementing a similarity framework, according to some embodiments.

FIGS. 2A-C are simplified diagrams of a similarity framework, according to some embodiments.

FIGS. 3A-B are simplified diagrams of an ensemble of trees, according to some embodiments.

FIGS. 4A-J are simplified diagrams of a pseudo-code for training and using the similarity framework, according to some embodiments.

FIG. 5 is a simplified diagram of a method for determining an ensemble of trees in a similarity framework, according to some embodiments.

FIG. 6 is a simplified diagram of a method for determining similarity between objects, according to some embodiments.

FIG. 7 is a block diagram of a computing system for identifying securities with similar characteristics, according to some embodiments.

FIG. 8 is a block diagram illustrating a computing system for identifying substitute securities, according to some embodiments.

FIG. 9 is a block diagram illustrating a computing system for generating and pricing clusters, according to some embodiments.

FIG. 10 is a simplified diagram of a method for substituting a security with a similar security, according to some embodiments.

FIG. 11 is a simplified diagram of a method for determining clusters of securities, according to some embodiments.

In one or more implementations, not all of the depicted components in each figure may be required, and one or more implementations may include additional components not shown in a figure. Variations in the arrangement and type of the components may be made without departing from the scope of the subject disclosure. Additional components, different components, or fewer components may be utilized within the scope of the subject disclosure.

DETAILED DESCRIPTION

A similarity framework can be used to identify relationships between objects and evaluate the strength between them. The output of the similarity framework may be a similarity matrix. The similarity matrix is a symmetric matrix of n×n rows and columns representing objects. An element of the matrix is a similarity score between two objects identified by the row and column of the matrix. The similarity score identifies the strength of a relationship between the two objects.

The similarity framework, such as the one described in the embodiments below, may be used to identify similarity between different types of objects. When objects are images, and the similarity framework identifies similar images, one image may replace another to be used in, e.g., image recognition systems. When objects are articles, similar articles may be identified to determine current trends. When objects are documents, similar documents may identify plagiarism. When objects are transactions, similar or dissimilar transactions may identify fraud. The similarity framework may also be used in various natural language processing tasks including text summarization, translation, etc. The similarity framework may be used to identify similar securities and substitute a security of interest with another security with similar characteristics. This has applications in trading and liquidity when, for example, a bond cannot be sourced from the market or, in another example, in portfolio construction where one or more securities may be replaced with other securities that are mostly similar but with more desirable properties or characteristics.

The similarity framework may include a supervised machine learning algorithm, such as a Gradient Boosting Machines (GBM) algorithm. The GBM algorithm may train an ensemble of decision trees using a training dataset that includes features of different objects. Once the similarity framework is trained, the similarity framework receives objects. The objects are propagated through each tree in the ensemble of decision trees until the objects reach the leaf nodes. The GBM algorithm may compute the leaf node of every tree in the ensemble that corresponds to the object. Thereafter, the similarity between two objects is defined as the percentage of trees in the ensemble where the two objects fall into the same leaf node. For example, the similarity framework may assign a similarity score of one when two objects share the same leaf node, otherwise the similarity framework may assign a score of zero. In another example, instead of assigning a score that is zero or one, the score between the two objects in the same tree may vary from zero to one based on the height of the deepest node in the tree that the objects share and the height of the tree. This means that if the two objects share a leaf node, the score may be one, if the objects split at the root, the score may be zero, or if the objects split elsewhere in the tree, the score may be a number between zero and one, based on d_c/d where d_cis the depth of the deepest node that the objects have in common and d is the depth of the entire tree.

In some embodiments, the similarity framework may assign different weights to the scores from different trees. The weights may be assigned based on the importance of the tree in the ensemble of trees compared to other trees in the ensemble of trees. The weight associated with each tree may be based on a reduction in the training error contributed by that tree to the ensemble of trees.

The output of the similarity framework may be a similarity matrix. The similarity matrix may include object similarity scores for pairs of objects determined from the ensemble of trees. Each object similarity score may be a combination of similarity scores generated by each tree in the ensemble of trees.

FIG. 1 is a simplified diagram of a computing device 100, according to some embodiments. As shown in FIG. 1, computing device 100 includes a processor 110 coupled to a memory 120. Operation of computing device 100 is controlled by processor 110. And although computing device 100 is shown with only one processor 110, it is understood that processor 110 may be representative of one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), graphics processing units (GPUs) and/or the like in computing device 100. Computing device 100 may be implemented as a stand-alone subsystem, as a board added to a computing device, and/or as a virtual machine.

Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.

Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities. Although illustrated as a single processor 110 and a single memory 120, the embodiments may be executed on multiple processors and stored in multiple memories.

In some examples, memory 120 may include a non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein. In some embodiments, memory 120 may store a similarity framework 130. Similarity framework 130 may be trained using machine learning to identify similarity between objects. Similarity between objects may include a same or similar characteristics or a set of same or similar characteristics that satisfy an objective. Similarity may be quantified by an object similarity score. Similarity framework 130 may receive objects 140 as input. Using the objects 140, similarity framework 130 may generate a similarity matrix 150 that includes similarity scores for the objects. A similarity score between a pair of objects in the similarity matrix 150 may identify similarity between the pair of objects.

FIG. 2A is a block diagram 200A of similarity framework 130, according to some embodiments. Similarity framework 130 may include an ensemble of trees 202. Trees in the ensemble of trees 202 may be decision trees, where the decision at each node is based on one or more properties associated with the node. The similarity framework 130 may be trained to identify trees in the ensemble of trees 202 as well as the properties that may be included in decision nodes of each tree. Each tree in the ensemble of trees may be a binary tree.

In some embodiments, the trees are trained using training loss. For example, if there are k number of trees, the set of training loss (TL) at each step may be defined as follows:

TL=(TL₀, TL₁, . . . , TL_K−1) Equation (1)

In some embodiments, the training loss may be a monotonically decreasing set of numbers that reflects that the training loss decreases with every step or tree added to the ensemble of trees 202. The training loss at each step may be a result of the performance of all the trees that preceded that step.

Similarity framework 130 may be trained to capture an importance of each tree in the ensemble of trees 202. The importance of a tree in the ensemble of trees 202 may be captured using an importance vector. To compute the importance vector, an absolute difference in the training loss is computed as follows:

s₀=|TL₁−TL₀| Equation (2)

s_i=|TL_i−TL_i−1|∀ i ∈ {1,2, . . . , K−1} Equation (3)

Using the absolute difference in the training loss, the final importance weight for a tree may then be determined as follows:

$\begin{matrix} w_{i} = \frac{s_{i}}{\sum_{i = 0}^{K - 1} s_{i}} & Equation (4) \end{matrix}$

Once trees in ensemble of trees 202 are identified and trained and the corresponding weights are determined, similarity framework 130 enters an inference stage. In the inference stage, similarity scores for different objects may be determined. For example, for a given ensemble of trees 202 (also referred to as an ensemble f), similarity framework 130 may determine similarity between two objects X₁and X₂as follows. First, similarity framework 130 may propagate the two objects X₁and X₂down all the trees within ensemble f by comparing features of the objects X₁and X₂to properties of the tree nodes of the trees until objects X₁and X₂reach the leaf nodes. Next, the terminal node position of object X₁and object X₂in each of the leaf nodes of the trees is recorded. Let Z₁=(Z₁₁, Z₁₂, . . . , Z_1K) be the tree node positions for object X₁and Z₂=(Z₂₁, Z₂₂, . . . , Z_2K) be tree positions of the leaf nodes for object X₂. Then, the similarity S between objects X₁and X₂in a tree may be determined as follows:

S(X₁, X₂)=Σ_i=0^K−1I(Z_1i==Z_2i)w_i Equation (5)

where I is the indicator function. The similarity score between objects X₁and X₂in a tree may then defined as:

D(X₁, X₂)=1−S(X₁,X₂) Equation (6)

By construction, D is a number that may range from 0 to 1. Similarity framework 130 repeats this process to determine similarity scores for multiple objectives in different trees in ensemble of trees 202, which results in multiple distances or tree scores given by D_OBJ1(X₁, X₂), D_OBJ2(X₁, X₂) and D_OBJ3(X₁, X₂). Similarity framework 130 may combine these distances into a single distance, e.g. similarity score which may be a weighted Euclidean distance which is an overall object similarity score 206, as follows:

$\begin{matrix} D_{Combined} (X_{1}, X_{2}) = \sqrt[2]{\frac{{D_{OBJ 1} (X_{1}, X_{2})}^{2} + {D_{OBJ 2} (X_{1}, X_{2})}^{2} + {D_{OBJ 3} (X_{1}, X_{2})}^{2}}{3}} & Equation (7) \end{matrix}$

Similarity framework 130, may determine similarity between both structured and unstructured objects. When objects X₁and X₂are structured objects, e.g. objects with features that may be found in a particular field in an object or quantified, similarity framework 130 may determine similarity score 206 as discussed above. When objects X₁and X₂are unstructured objects, e.g., objects with features that are qualitative, such as features included in objects that are text, images, etc., similarity framework 130 may first encode the features of unstructured objects X₁and X₂using encoder 204 into encodings. Ensemble of trees 202 may be trained on the encodings and use the encoding to determine similarity score 206.

In some embodiments, similarity framework 130 may determine similarity scores for multiple objects. FIG. 2B is a block diagram 200B of similarity framework 130, according to some embodiments. As illustrated in FIG. 2B, similarity framework 130 receives objects 140 as input. Objects X₁and X₂discussed in FIG. 2A may be two objects in objects 140. For every two objects in objects 140, similarity framework 130 determines object similarity score 206. Using object similarity scores 206, similarity framework 130 then builds similarity matrix 150. The similarity matrix 150 is a n×n matrix where n is an integer that corresponds to a number of objects 140. The elements of the similarity matrix 150 are similarity scores 206 for objects 140. For example, each element in the similarity matrix 150 corresponds to similarity score 206 for a pair of objects in objects 140.

FIG. 2C is a block diagram 200C of similarity framework 130 trained to determine an ensemble of trees, according to some embodiments. During a training phase, similarity framework 130 generates trees for inclusion into ensemble of trees 202. Similarity framework 130 identifies trees using one or more functions. The functions may be determined using a function estimation method based on a dataset (x, y)_i=1^Nthat comprises input variables x and y where x=(x₁, x₂, . . . , x_d) and corresponding labels are y. In some embodiments, the input variables may be features from different objects. The ensemble of trees 202 may be constructed to minimize the variance of the input features. The goal is to reconstruct the unknown functional dependence

$x \overset{f}{\to} y$

with an estimate {circumflex over (f)}(x), such that some specified loss function Ψ(y, f) is minimized, as follows:

$\begin{matrix} \hat{f} (x) = y & Equation (8) \end{matrix}$ $\begin{matrix} \hat{f} (x) = \arg \min_{f (x)} Ψ (y, f) & Equation (9) \end{matrix}$

The function estimation problem may be re-written in terms of expectations where an equivalent formulation would be to minimize the expected loss function over a response variable E_y(Ψ(y, f(x)), conditioned on the observed explanatory data x:

$\begin{matrix} \hat{f} (x) = \arg \min_{f (x)} E_{x} [E_{y} (Ψ (y, f (x)) ❘ x] & Equation (10) \end{matrix}$

The response variable y may come from different distributions. This leads to specification of different loss functions Ψ. In particular, if the response variable is binary, i.e., y ∈ {0,1}, the binomial loss function may be considered. If the response variable is continuous, i.e., y ∈ R, the L₂squared loss function or the robust regression Huber loss function may be used. For other response distributions, specific loss functions may be designed. To make the problem of function estimating tractable, the function search space may be restricted to a parametric family of functions f(x, θ). This may change the function optimization problem into the parameter estimation problem:

$\begin{matrix} \hat{f} (x) = f (x, \hat{θ}) & Equation (11) \end{matrix}$ $\begin{matrix} \hat{θ} = \arg \min_{θ} E_{x} [E_{y} (Ψ (y, f (x, θ)) ❘ x] & Equation (12) \end{matrix}$

Similarity framework 130 may use iterative numerical procedures to perform parameter estimation. In some embodiments, given M iteration steps, where M is an integer, the parameter estimates may be written in an incremental form as follows:

{circumflex over (θ)}=Σ_i=1^M{circumflex over (θ)}_i Equation (13)

In some embodiments, the steepest gradient descent may be used to estimate parameters. In the steepest gradient descent, given N data points (x, y)_i=1^N, the empirical loss function J(θ) is decreased over this observed data, as follows:

J(θ)=Σ_i=1^NΨ(y_i, f(x_i, {circumflex over (θ)}) Equation (14)

The steepest descent optimization procedure may be based on consecutive improvements along the direction of the gradient of the loss function ∇J(θ). As the parameter estimates {circumflex over (θ)} are presented in an incremental way, the estimate notation is distinguished. By the subscript index of the estimates {circumflex over (θ)}_t, the t-th incremental step of the estimate {circumflex over (θ)} is considered. The superscript {circumflex over (θ)}_tcorresponds to the collapsed estimate of the whole ensemble, i.e., sum of all the estimate increments from step 1 to step t. The steepest descent optimization procedure may be organized as follows.

- First, the parameter estimates {circumflex over (θ)}₀are initialized. Then steps two through five are repeated for each iteration t.
- Second, a compiled parameter estimate {circumflex over (θ)}^tis obtained from all of the previous iterations, as follows:

{circumflex over (θ)}^t=Σ_i=1^t−1{circumflex over (θ)}_i Equation (15)

- Third, the gradient of the loss function ∇J(θ) is evaluated, given the obtained parameter estimates of the ensemble:

$\begin{matrix} \nabla J (θ) = {\nabla J (θ_{i})} = {[\frac{\partial J (θ)}{\partial J (θ_{i})}]}_{θ = {\hat{θ}}^{t}} & Equation (16) \end{matrix}$

- Fourth, the new incremental parameter estimate {circumflex over (θ)}_tis determined as follows:

{circumflex over (θ)}_t←∇J(θ) Equation (17)

- Fifth, the new estimate {circumflex over (θ)}_tis added to the ensemble.

In some embodiments, similarity framework 130 may perform optimization that may occur in a function space. In this case, the function estimate {circumflex over (f)} is parameterized in the additive functional form:

{circumflex over (f)}(x)={circumflex over (f)}^M(x)=Σ_i=0^M{circumflex over (f)}_i(x) Equation (18)

where M is the number of iterations, {circumflex over (f)}₀is the initial guess and {{circumflex over (f)}_i}_i=1^Mare the function increments, also referred to as “boosts”.

In some embodiments, the parameterized “base-learner” functions h(x, θ) may be distinguished from the overall ensemble function estimates {circumflex over (f)}(x). Different families of base-learners functions such as decision trees or splines functions may be selected.

In a “greedy stagewise” approach for incrementing the function with the base-learners, the optimal step-size p may be specified at each iteration. For the function estimate at the t-th iteration, the optimization rule may be defined as follows:

$\begin{matrix} {\hat{f}}_{t} \leftarrow {\hat{f}}_{t - 1} + ρ_{t} h (x, θ_{t}) & Equation (19) \end{matrix}$ $\begin{matrix} (ρ_{t}, θ_{t}) = \arg \min_{ρ, θ} \sum_{i = 1}^{N} Ψ (y_{i}, {\hat{f}}_{t - 1}) + ρ h (x_{i}, θ) & Equation (20) \end{matrix}$

In some embodiments, similarity framework 130 may arbitrarily specify both the loss function Ψ(y, f) and the base-learner functions h(x, θ), on demand. In some embodiments, a new function h(x, θ_t) may be the most parallel to the negative gradient {g_t(x_i)}_i=1^Nalong the observed data:

$\begin{matrix} ℊ_{t} (x) = {E_{y} [\frac{\partial Ψ (y, f (x))}{\partial f (x)} ❘ x]}_{f (x) = {\hat{f}}^{t - 1} (x)} & Equation (21) \end{matrix}$

In this way, instead of looking for a general solution for the boost increment in the function space, the new function increment may be correlated with −g_t(x). This simplifies the optimization task with the least-squares minimization task:

$\begin{matrix} (ρ_{t}, θ_{t}) = \arg \min_{ρ, θ} \sum_{i = 1}^{N} {[- ℊ_{t} (x_{i}) + ρ h (x_{i}, θ)]}^{2} & Equation (22) \end{matrix}$

In some embodiments, the GBM algorithm may be trained using Python or another programming language known in the art. The loss function ψ(y, f) may be the L₂loss. The GBM algorithm may train trees on residual vectors or sign vectors.

In some embodiments, the base learner function h(x, θ) that may be used is a decision tree stump and may restrict the total number of leaf nodes to a configurable number, e.g., sixteen leaf nodes.

FIGS. 3A and 3B are diagrams 300A and 300B of an ensemble of trees, according to some embodiments. FIG. 3A illustrates ensemble of trees 202 that includes trees 302A, 302B, and 302C generated using similarity framework 130. Each one of trees 302A-C includes nodes associated with at least one property to which one or more input features of objects 140 may be compared. For illustrative purposes only, in tree 302A, a property associated with node 304A may be a>3, a property associated with node 304B may be b>150, a property associated with node 304C may be c>5 and a property associated with node 304D may be d>50. When objects 140, such as C1, C2, and C3 pass through trees 302A-C, the features or characteristics of objects C1-C3 are compared to the properties of tree nodes 304A-C until the objects C1-C3 reach the leaf nodes. Once the objects C1-C3 reach the leaf nodes, similarity framework 130 determines the similarity scores 206, such as a similarity score S_c₁₂₂between objects C1 and C2 and a similarity score S_c₁₂₃between objects C1 and C3. As discussed above, in some embodiments, the functions for determining a similarity score s may be as follows:

$\begin{matrix} similarity, s = \frac{number of trees with same leaf node}{total number of trees in the ensemble} & Equation (23) \end{matrix}$

As illustrated in FIG. 3A, objects C1 and C2 share a leaf node in tree 302A, and objects C1, C2, and C3 share a leaf node in tree 302C. In the case where similarity framework 130 assigns a score=1 when the objects 140 share a leaf node, and a score=0 when the objects do not share a leaf node, a similarity score

$S_{C_{1} C_{2}} = \frac{2}{3}$

and similarity score

$S_{C_{1} C_{3}} = \frac{1}{3} .$

FIG. 3B illustrates a similarity score that may be determined when objects 140 in different leaf nodes are given non-zero scores. Instead, the similarity score varies with the distances between leaf nodes, such that the similarity score increases when the leaf nodes with objects 140 are closer to each other and decreases when the leaf nodes are further away. FIG. 3B illustrates object A in a leaf node 304E, object B in a leaf node 304F, and object C in a leaf node 304G of tree 302A. Notably, the leaf nodes 304E and 304F that store respective object A and object B are closer to each other than the leaf node 304G that stores object C. Similarity score 206 for objects A, B, and C in trees 302A-C may be determined based on the depth of the deepest common node shared between the objects A and B, B and C, and A and C divided by the depth of the trees 302A-C. With respect to FIG. 3B, objects A and B share the common node 304C (depth=3) in tree 302A, with tree 302A having depth=4. Accordingly, the similarity score between objects A and B in tree 302A is

$\frac{3}{4},$

the similarity score between objects A and C in tree 302A is

$\frac{1}{4},$

and similarity score between objects B and C in tree 302A is

$\frac{1}{4} .$

FIGS. 4A-J are diagrams of a pseudo-code for training the similarity framework and generating a similarity matrix, according to some embodiments. Notably, FIGS. 4A-4E pertain to the training stage that trains similarity framework 130 to generate ensemble of trees 202 and FIGS. 4F-4J pertain to an inference stage that determines similarity matrix 150 that includes similarity scores 206 for pairs of objects in objects 140.

FIG. 4A illustrates an example pseudo-code 400A that generates ensemble of trees 202 in similarity framework 130 and similarity matrix 150, according to some embodiments. The functions illustrated in the pseudo-code 400A of FIG. 4A are further described in FIGS. 4B-4J.

FIG. 4B illustrates an example pseudo-code 400B that identifies features, according to some embodiments. In FIG. 4B, pseudo-code 400B determines multiple features or characteristics that are included in different objects 140. The features may depend on the object type. For unstructured data, the features may be transformed into structured data using an encoder which encodes features into feature encodings. Features may be words, word types, etc., for objects 140 that are documents. Features may be pixels, pixel colors, image colors, pixel positions, image types, etc. for objects 140 that are images. Features may be sectors, issues, coupons, industry sectors, ratings, spreads, yields, dates to maturity, ages, markets, countries, etc., for objects 140 that are securities or bonds. Notably, there may be other types of features for other types of objects 140. Features may also be static, dynamic, or engineered. Static features may remain the same over a period of time, while dynamic features may change over a period of time. Engineered features may be features that are created from static or dynamic features to be included in the training data.

FIG. 4C illustrates an example pseudo-code 400C that determines an optimal hyperparameter, according to some embodiments. In FIG. 4C, pseudo-code 400C determines trees using features identified in FIG. 4B and one or more hyperparameters. The hyperparameters may be randomly generated or included in a configuration file. Example hyperparameters may include learning rate, number of leaves in each tree, minimum data or features in each leaf, and a number of trees in ensemble of trees 202.

To determine an optimal hyperparameter, multiple trees may be generated for each hyperparameter and scored. For example, for each hyperparameter, the features may be divided into a training dataset and a validation dataset. The trees, including properties and values of the properties at each node, may be generated with the GBM algorithm using the hyperparameter and the features in the training dataset. The trees may be validated with the features in the validation dataset that validates that objects in the dataset meet a particular objective. The trees may also be scored. After the trees based on the hyperparameter are generated, the hyperparameter may be scored by averaging the scores from the trees. An optimal hyperparameter may be determined using an “argmin” function, or another function, based on the scores associated with different hyperparameters. The “argmin” function, for example, identifies the hyperparameter associated with the lowest hyperparameter score from the scored hyperparameters. The lowest hyper-parameter score simulates the minimal loss discussed above.

As illustrated in FIG. 4C, the trees may be determined using a GBM algorithm. Trees may also be determined using another algorithm. As discussed above, the GBM algorithm may learn a base learner function and a loss function, which may correlate to a negative gradient to minimize loss through multiple iterations. The hyperparameter may control the step of the iterations during which the base function is learned.

The trees associated with different hyperparameters may include, but not limited to, anywhere from five to three-hundred trees and may have tree depth anywhere from five to sixteen nodes. During training, the features and the properties of the features that are associated with in each node are also determined.

FIG. 4D illustrates an example pseudo-code 400D that generates an ensemble of trees. In FIG. 4D, pseudo-code 400D determines an ensemble of trees 202 using features determined in FIG. 4B and the optimal hyperparameter determined in FIG. 4C. Similar to FIG. 4C, the ensemble of trees 202 may be determined using the GBM algorithm.

FIG. 4E illustrates an example pseudo-code 400E that determines an importance of a tree in the ensemble of trees, according to some embodiments. As discussed above, pseudo-code 400D determines ensemble of trees 202. Pseudo-code 400E determines an importance of each tree in ensemble of trees 202. For example, as each tree is added to the ensemble of trees 202, pseudo-code 400E determines the importance of the newly added tree. The importance of the newly added tree may be determined by calculating a performance of the ensemble of trees 202 before and after adding the new tree to the ensemble of trees 202, and determining by how much an error generated by the ensemble of trees 202 was reduced by adding the new tree.

FIG. 4F illustrates an example pseudo-code 400F that determines a similarity matrix, according to some embodiments. In FIG. 4F, the pseudo-code 400F receives objects 140 and generates similarity matrix 150 that includes similarity scores 206. As discussed above, each object is passed through each tree in ensemble of trees 202 where the features of the object are compared to the one or more properties of each node in the tree, until the object reaches a leaf node. Pseudo-code 400F then determines the similarity score 206 between the object and the other objects, by first determining a similarity score associated with each tree and then computing the object similarity score 206 by combining the similarity scores from the individual trees. The similarity score for each tree may be a distance between the objects in the leaf nodes of the corresponding tree, as discussed above. If the objects are in the same node, the similarity score may be one, otherwise the similarity score may be zero. Pseudo-code 400F also illustrates that the similarity score for an individual tree may be adjusted by the importance of the tree in the ensemble of trees 202. To compute the similarity score 206 for all trees, pseudo-code 400F combines the scores from the individual trees. The similarity score 206 for all trees is included into similarity matrix 150.

FIGS. 4G, 4H and 4I illustrate example pseudo-codes 400G, 400H and 400I for traversing the trees in ensemble of trees to determine leaf nodes, the distance between the leaf nodes, and the depth of the trees, according to some embodiments.

FIG. 4J illustrates an example pseudo-code 400J that determines a similarity matrix, according to some embodiments. In FIG. 4J, the pseudo-code 400J receives objects 140 and generates similarity matrix 150 that includes similarity scores 206. Unlike pseudo-code 400F, pseudo-code 400J generates similarity scores based on the distance between objects in a single tree, such that the distance is a measure of the depth of the deepest common node shared between the objects divided by the depth of the tree. The similarity scores from each tree may then be adjusted to accommodate for the importance of the tree and then combined to generate similarity score 206. The deepest common node shared between the objects and the depth of the tree may be determined using pseudo-code 400G, 400H and 400I.

FIG. 5 is a simplified diagram of a method 500 for determining an ensemble of trees in a similarity framework, according to some embodiments. One or more of the processes 502-506 of method 500 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 502-506.

At process 502, features are determined. For example, similarity framework 130 is trained on input data, which may be a training dataset of features. The features may be specific to objects of a particular type and may be extracted from an object. Features may be static, dynamic, or engineered. Static features may be features that do not change with time over a period of time. Dynamic feature may be features that change over a period of time. Engineered features may be created using static and dynamic features. In some embodiments, when objects include unstructured data, static, dynamic, and engineered features may be encoded into structured features using encoder 204.

At process 504, an ensemble of trees is generated. For example, similarity framework 130 may generate an ensemble of trees 202 using features and the GBM algorithm. The trees in the ensemble of trees 202 may be constructed to minimize a variance the features. Specifically, the similarity framework 130 constructs and reconstructs trees using a base function that receives features as input and generates labels, such that the function loss during the reconstruction is minimized Each tree in the ensemble of trees 202 may include one or more properties at each node of each tree with exception of the leaf nodes. As illustrated in FIGS. 3A and 3B, the nodes may include one or more properties and corresponding values of the one or more properties that may be compared against features of the objects 140.

At process 506, tree importance for each tree in the ensemble of trees is determined. For example, similarity framework 130 may determine an importance of each tree in the ensemble of trees 202 by determining the accuracy of the ensemble of trees 202 before and after each tree is added to ensemble of trees 202. The tree importance may correspond to how important the tree is to determining similarity between objects 140. The measure of the importance may be a weight having a value between zero and one.

Once method 500 completes, the similarity framework 130 has generated ensemble of trees 202 and determined the measure of importance of each tree in ensemble of trees 202. At this point, similarity framework 130 may enter an inference stage where the similarity framework 130 determines similarity between objects 140.

FIG. 6 is a simplified diagram of a method 600 for determining similarity between objects, according to some embodiments. One or more of the processes 602-608 of method 600 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 602-608.

At process 602, objects are received. For example, similarity framework 130 receives objects 140. The objects 140 may be the same type of objects that were used to train similarity framework 130 to generate the ensemble of trees 202.

At process 604, objects are propagated through trees in the ensemble of trees. For example, similarity framework 130 propagates objects 140 received in process 602 through each tree in ensemble of trees 202 until objects 140 reach the leaf nodes of the trees. Typically, each object 140 may be propagated through each tree in the ensemble of trees 202. As the objects are propagated, the similarity framework 130 compares the features of the object 140 to properties of the nodes of the tree in the object's path to the leaf node.

At process 606, a similarity score for pairs of objects is determined. For example, similarity framework 130 may determine a similarity score 206 for every two objects. First, similarity framework determines a similarity score for pairs of objects in each tree in ensemble of trees 202. In one instance, the similarity score for a pair of objects in the same tree may be one if the objects share the same leaf node and zero otherwise. In another instance, the similarity score may be a measure of a distance between leaf node(s) of the tree that store the pair of objects. The similarity score for the pair of objects in each tree may be determined based on the tree distance and the tree height. For example, the similarity score may be a measure of the distance from the root node to the last node that the two objects share that is divided by the depth of the tree. In some embodiments, the similarity score is further adjusted based on the tree importance. The object similarity score 206 may then be determined by combining the similarity scores for the pair of objects from each tree in the ensemble of trees 202. Process 606 repeats until similarity framework 130 determines similarity scores 206 for all pairs of objects in objects 140.

At process 608, a similarity matrix is generated. For example, the similarity score 206 for all pairs of objects determined in process 606 is stored in the similarity matrix 150.

Going back to FIG. 1, similarity framework 130 determines similarity matrix 150. The similarity scores between objects in similarity matrix 150 may identify similarity between objects 140. As discussed above, objects 140 may include structured or unstructured data. If objects 140 include unstructured data, similarity framework 130 may pass the objects through an encoder that may encode the unstructured data into encodings which are structured data. In this way, similarity framework 130 may identify similarities between objects 140 that store unstructured data, such as similar images, similar articles, similar documents, etc., as well as structured data, such as securities or transactions.

Objects 140 may be transactions. When objects 140 are transactions, similarity framework 130 may be trained on the training data that includes transaction features for a predefined objective. Once trained, similarity framework 130 may identify, e.g., for a fraud objective, similar and different transactions based on the transaction features. The transactions or a cluster of transactions that similarity framework 130 determines as different, may be considered outlier. An outlier transaction is a transaction that has different features from other transactions, or a transaction that is not included in the training dataset. Outlier transactions may be indicative of fraud. In another example, outlier transactions may be indicative of data errors elsewhere in a transaction processing system. For example, suppose similarity framework 130 is trained on a training dataset that includes previous transactions that passed through a transaction system and that are known to be genuine or include valid data. During an inference stage, similarity framework 130 may identifier outlier transactions that are indicative of transactions that are different from transactions that have previously passed through the transaction system and were included in the training data. Different transactions may be transactions that have similarity score 150 below a similarity threshold for one or more entries in similarity matrix 150. These transactions may be indicative of fraudulent transactions or transactions that included erroneous data.

In some embodiments, similarity framework 130 may quantify uncertainty in the data. For example, similarity framework 130 may be trained on a training dataset that includes objects. Once trained, similarity framework 130 may receive objects and determine how similar an object in objects 140 is to the data in the training dataset. An object that is not similar may be considered an outlier or be out of training distribution. In some instances, similarity framework 130 may also include a classifier. The classifier may indicate how similar object 140 may be to data in the training dataset.

In some embodiments, similarity framework 130 may identify securities with similar characteristics. FIG. 7 is a block diagram 700 of a similarity framework for identifying securities with similar characteristics, according to some embodiments. Similarity framework 130 includes ensemble of trees 202 trained to identify similar securities. As illustrated in FIG. 7, similarity framework 130 receives securities 740 and uses ensemble of trees 202 to generate similarity matrix 150 from which similar securities may be identified. In some instances, securities 740 may be structured data. As discussed above, ensemble of trees 202 may be generated using supervised machine learning, such as gradient boosting machines (GBM) algorithm. Similarity matrix 150 may include similarity scores for or between every two securities in securities 740.

In some embodiments, trees in ensemble of trees 202 may be trained using various similarity objectives. Example similarity objectives may include option adjusted spreads (OAS), yield, BAS, and various yield returns, such as 1-week, 2-week, 1-month, 3-month, and 6-month yield returns. There may be one tree in ensemble of trees 202 for each objective or each tree may model multiple similarity objectives, such as a combination of OAS and yield, a combination of OAS, yield, and BAS, or a combination of multiple yield returns. In some embodiments, the trees 202 may include a combination of objectives. An example combination may be 50% OAS and 50% yield. Notably, the combination of different objectives may be flexible and trees 202 may be trained and retrained using different objectives or a combination of objectives.

As discussed above, each tree in ensemble of tree 202 may be trained using various features. To determine similarity between securities 740, similarity framework 130 may train trees in ensemble of trees 202 using features, such as ticker (e.g. an issue of the bond), dtm (days to mature), an industry group that the security belongs to, rating of the security, age of the security, market where the security is originally issued, and/or currency. Notably, the list of features above is not limiting and each tree may be trained on a combination of one or more features.

Similarity matrix 150 may be an n by n matrix that has securities 740 (or a numerical representation of securities 740) as rows and columns. Each entry in similarity matrix 150 identifies a similarity between a pair of securities in securities 740. The similarity entry may have a value between zero and one. A value of one may indicate that the pair of securities are perfectly similar, a value between 0.8 and 1.00 may indicate that the pair of securities are strongly similar, a value between 0.6 and 0.8 may indicate that the pair of securities are moderately similar, a value between 0.8 and 0.6 may indicate that the pair of securities are fairly similar, a value between 0.0 and 0.3 may indicate that the pair of securities are poorly similar, and a value that is zero may indicate that the pair of securities are not similar.

In some embodiments, similarity matrix 150 may be used to generate clusters of similar securities. As illustrated in FIG. 7, a clustering module 704 may receive similarity matrix 150 and generate clusters 706. Each cluster in clusters 706 may include similar securities. In some instances, clustering module 704 may use a hierarchical clustering algorithm that generates a predefined number of clusters 706, or another clustering algorithm.

Identifying similar securities has numerous benefits. For example, if one security is a bond that cannot be sourced or obtained from a bond market, similarity framework 130 may identify a second security that is a bond with similar characteristics to the first one, but that can be sourced from a bond market. In another example, when a trading portfolio is constructed, similarity framework 130 may identify securities that are similar to securities in the portfolio that that have more desirable properties, such as better tradability and pricing than the securities in the original portfolio. Better tradability may be considered to be greater liquidity of the security, and better pricing may be considered to be a better bid or offer price for the security. The securities identified using similarity framework 130 may then replace the similar securities in the portfolio that cannot be easily sourced or traded. An example security may belong to any asset class, such as equities, mortgage securities, municipal bonds, etc. The similar securities may be selected from similarity matrix 150 or from clusters 706. Similar securities may be securities that have a similarity score above a predefined similarity threshold as determined by similarity framework 130. For example, based on one of more of the factors, a higher similarity score (e.g., greater than 0.8 out of 1.0) may be needed to replace a security with a similar security.

Notably, although examples herein discuss similarity between securities, the embodiments may also be directed to other instruments, such as fund(s), exchange traded securities (EFTs), bonds, etc., and/or a combination of instruments, such as a combination that includes one or more of EFTs, portfolios, single securities and/or funds.

FIG. 8 is a block diagram 800 illustrating a similarity framework identifying substitute securities, according to some embodiments. As illustrated in FIG. 8, a pre-trade analytics engine or module 804 receives an investment 802. Investment 802 may be a monetary amount that may be invested according to an investment strategy. The pre-trade analytics engine 804 may use one or more analytics algorithms to determine an investment strategy. The investment strategy may include one or more securities that may be purchased using investment 802. Alternatively, the investment strategy for investment 802 may be set using an investment manager, and pre-trade analytics engine 804 may optimize the investment strategy as discussed below. The securities may be bought or sold individually or as part of a portfolio. The securities may be liquid securities 806 and low liquidity securities 808. Liquid securities 806 may be securities having high liquidity, such as a liquidity above a high liquidity threshold. Liquid securities 806 may be easily bought or sold using a trading system 810. A high liquidity may be liquidity above a predefined liquidity threshold. Low liquidity securities 808 may be securities that are illiquid or may not be easily bought or sold using a trading system 810. Low liquidity securities may be securities have liquidity below a predefined low liquidity threshold.

In some instances, to determine liquidity of securities, each security may be scored using a liquidity algorithm. Notably, the liquidity algorithm may be instrument specific, and the liquidity algorithm used to determine the liquidity of securities may be different from the liquidity algorithm used to determine liquidity of funds, EFTs or other types of instruments. The liquidity algorithm may assign a liquidity score to a security that identifies the security as a liquid or a low liquid security. For example, the score may be between 0 and 100, in some embodiments, where the score that is greater than e.g., 70, identifies a security as liquid security 806 and a score that is less than e.g., 30, identifies a security as low liquid security 808. To determine a score for the security, the liquidity algorithm may use one or more rules which are associated with a score. Example rule may be whether a security may or may not be traded, whether security may be traded, but a trade is more expensive as compared to other securities, or whether a buyer pays a premium for liquidity of a security. The liquidity score may be a combination of scores from the one or more rules.

In some embodiments, the low liquidity securities may be passed to similarity framework 130. Similarity framework 130 may be trained on various securities accessible to a trading system 810. Similarity framework 130 may store similarity matrix 150, described above, that has already been generated. Similarity framework 130 may generate similarity matrix 150 at predefined intervals, such as every week, bi-monthly, monthly, etc., on demand, or at specific market conditions. In some embodiments, similarity framework 130 may use similarity matrix 150 to identify liquid securities 806A that are similar (have similarity score that is greater than a predefined similarity threshold) to low liquidity securities 808. Liquid securities 806A may be more liquid than low liquidity securities 808, but have the same or similar objectives and/or the same or similar purchase or sale price, in some embodiments. As discussed above, low liquidity securities 808 may include illiquid bonds that are difficult to trade, while liquid securities 806A may be liquid bonds that are easier to trade, but with the same or similar objective as liquid securities 806A.

Trading system 810 may receive, purchase, and/or sell easily tradeable securities 806 and 806A. Trading system 810 may trade liquid securities 806 and 806A internally, using trading systems, including electronic trading systems of other vendors, financial institutions, broker/dealers, or using external exchanges. In some embodiments, the trading system 810 trading liquid securities 806 and 806A, rather than liquid securities 806 and low liquidity securities 808 may increase execution of orders requested using pre-trade analytics module 804. For example, suppose pre-trading analytics module 804 generates an order that includes securities 806 and 808. Trading system 810 may execute 60% of the order using various exchanges. However, when similarity framework 130 replaces securities 808 in the order with securities 808A, trading system 810 may be able to execute 80% of the order. In another example, investing investment 802 using securities 806 and 806A rather than securities 806 and 808 may allow for an improved transaction cost while maintaining a similar risk for investment 802.

In the embodiments discussed above, the pre-trade analytics module 804 may generate or optimize an investment strategy that may include liquid securities 806 and illiquid securities 808, which may then be substituted using similarity framework 130. Notably, those embodiments are exemplary only, and the embodiments discussed above may also apply to generating and/or optimizing portfolios or portions of portfolios that include multiple securities 806 and 808, where the some or all securities 808 may be substituted for securities 806A. The embodiments may also be applied to optimizing funds or EFTs. Further, although embodiments above describe substituting securities 808 with securities 806A based on liquidity, the embodiments may also apply to substituting securities 808 with similar securities based on a different target. An example target may be an improved transaction cost as compared to securities 808, decreased risk as compared to securities 808, etc.

FIG. 9 is a block diagram 900 illustrating similarity framework generating and pricing clusters, according to some embodiments. In FIG. 9, similarity framework 130 may be trained on input features 910 and one or more objectives 912. Input features 910 may be features included in various securities 940. A non-exhaustive example of input features 910 may be features that define a security, such as, an industry, coupon, ticker, coupon type, coupon frequency, country, market, flag 144a maturity, age, rating, etc. An example objective 912 may be a spread between two or more securities, at various times. Different objectives 912 may be spreads covering different time intervals.

Using a machine learning algorithm, such as a GBM algorithm, input features 910, and objective(s) 912, similarity framework 130 may be trained to generate ensemble of trees 202. Ensemble of trees 202 may include trees 914A, 914B, 914C, and through 914N. The number of trees 914A-N may be determined using a hyperparameter. Each one of trees 914A-N in ensemble of trees 202 may be trained on different parameters for objective(s) 912, such as different change in spread. For example, tree 914A may be trained on a spread range at time T1, tree 914B may be trained on a spread range at time T2, etc. In some embodiments, similarity framework 130 may generate ensemble of trees 202 as discussed in FIGS. 4A-4E.

Once similarity framework 130 generates ensemble of trees 202, securities 940 may be passed through each tree 914A-N in ensemble of trees 202. Similarity framework 130 may score pairs of securities in securities 940 by determining a similarity score between the two securities in each tree in trees 914A-N. For example, similarity framework 130 may determine a score between a pair of securities for tree 914A, tree 914B, etc. Next, similarity framework 130 may combine the scores for the pair of securities from all trees 914A-N in ensemble of trees 202 into a combined security similarity score and store the combined similarity score in similarity matrix 150. The similarity matrix 150 may store combined scores for different pairs of securities in securities 940. Different ways to determine the security similarity scores are described with reference to FIGS. 3A-3B and 4F-4J.

A clustering module 904 may use a clustering algorithm to generate clusters 916A-M of similar securities from the similarity matrix 150. The clusters 916A-M may be configured within clustering module 904. Clusters 916A-M may also include liquid securities, such as securities 806 and exclude low liquidity securities, such as securities 808, in some embodiments. A price prediction module 918 may predict a cluster pricing from each cluster of securities in clusters 916A-M, such that each cluster of securities in clusters 916A-M may be traded at the predicted cluster price 920A-M.

FIG. 10 is a simplified diagram of a method 1000 for substituting a low liquidity security with a similar security, according to some embodiments. One or more of the processes 1002-1010 of method 1000 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 1002-1010.

At process 1002, an ensemble of trees in a similarity framework is determined. For example, similarity framework 130 receives input features 910 from securities 940 and one or more trading objectives 912. Based on the input features 910 and training objective(s) 912, similarity framework 130 uses a machine learning algorithm, such as a GBM algorithm, to generate ensemble of trees 202, where each node in each tree stores properties and values that correspond to the properties.

At process 1004, a similarity matrix is determined. For example, similarity framework 130 propagates securities 940 through each one of trees 914A-N in ensemble of trees 202 until securities 940 reach the leaf nodes of the trees 914A-N. Typically, each one of securities 940 may be propagated through each tree in the ensemble of trees 202. As the securities 940 are propagated, the similarity framework 130 compares the features of the security to a corresponding property and the property's value stored in a node of the tree in the security's path. A score for pairs of securities in each tree is determined based on a distance between the leaf node(s) that store the securities in each tree 914A-N. A combined distance for each pair of securities is determined by combining the scores for the each pair of securities from individual trees 914A-N in ensemble of trees 202 into a security similarity score. In some instances, the scores from each one of trees 914A-N may be adjusted by an importance of the corresponding tree as compared to other trees in ensemble of trees 202. Similarity matrix 150 is generated using the security similarity scores from every two securities.

At process 1006, a trading strategy is generated. For example, a pre-trade analytics module 804 may receive investment 802, which may be monetary investment. Using investment 802, pre-trade analytics module 804 may generate an investment strategy that includes one or more securities or a basket of securities. The securities in the investment strategy may be liquid securities 806, low liquidity securities 808, or a combination of both or any other type of security.

At process 1008, low liquid securities are substituted with liquid securities. For example, similarity framework 130 may use similarity matrix 150 to identity liquid securities 806A that are similar, that is, may have the same price and objective as low liquid securities 808, but have higher liquidity. The liquid securities 806A are substituted for low liquidity securities 808 in the investment strategy.

At process 1010, the investment strategy is executed. For example, trading system 810 may execute the investment strategy using liquid securities 806 and liquid securities 806A.

FIG. 11 is a simplified diagram of a method 1100 for determining clusters of similar securities, according to some embodiments. One or more of the processes 1102-1108 of method 1100 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 1102-1108.

At process 1102, an ensemble of trees in a similarity framework is determined. For example, similarity framework 130 receives input features 910 from securities 940 and one or more trading objective(s) 912. Based on the input features 910 and training objective(s) 912, similarity framework uses a machine learning algorithm, such as a GBM algorithm, to generate ensemble of trees 202, where each node in each tree stores values of at least one input features in input feature 910 that direct a path or a security as the security is propagated through the tree. In some embodiments, leaf nodes may not store the values of at least one input features.

At process 1104, a similarity matrix is determined. For example, similarity framework 130 propagates securities 940 through each one of trees 914A-N in ensemble of trees 202 until securities 940 reach the leaf nodes of the trees 914A-N. Typically, each security in securities 940 may be propagated through each one of trees 914A-N in the ensemble of trees 202. As the securities 940 are propagated, the similarity framework 130 compares the features of the securities 940 to corresponding properties and values of properties stored at each node. A score for the pairs of securities at the leaf nodes in each tree is determined based on the distance between the one or more leaf nodes that store the two securities in each tree. A security similarity score for the pairs of securities is determined by combining the individual scores for the every two securities from individual trees 914A-N in ensemble of trees 202. In some instances, the scores from each one of trees 914A-N may be adjusted by an importance of the trees. Similarity matrix 150 is generated using the security similarity scores from every two securities in securities 940.

At process 1106, clusters of similar securities are determined. For example, clustering module 904 may determine clusters 916A-M from similarity matrix 150. In some embodiments, the low liquidity securities, e.g., securities 808 may be excluded from clusters 916A-M, such that clusters 916A-M may include similar securities that are liquid securities, e.g., securities 806.

At process 1108, clusters of similar securities are priced. For example, price prediction module 918 may determine a cluster price 920A-M for clusters 916A-M.

Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of methods 500, 600, 1000, and 1100. Some common forms of machine readable media that may include the processes of methods 500, 600, 1000, and 1100 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.

This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.

In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.

Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.

Claims

1. A method for determining similarity, the method comprising:

generating, using a machine learning similarity framework, an ensemble of trees using features corresponding to first securities and at least one objective; and

determining a similarity matrix, the similarity matrix storing security similarity scores indicating similarities between pairs of securities in second securities, wherein determining the similarity matrix comprises: for each tree in the ensemble of trees: propagating the second securities through the each tree until the second securities reach leaf nodes of the each tree, wherein the propagating compares at least one feature associated with each second security in the second securities to at least one property of at least one node associated with the each tree; and determining a similarity score for each pair of securities by calculating a distance between the each pair of securities in the each tree; for the each pair of securities, combining similarity scores from the each tree in the ensemble of trees into a security similarity score; and storing the security similarity scores from the pairs of securities in the similarity matrix.

2. The method of claim 1, further comprising:

generating a strategy, the strategy including at least one low liquid security, wherein the low liquid security has liquidity below a low liquidity threshold;

determining, by comparing security similarity scores between the at least one low liquidity security and the second securities in the similarity matrix, at least one liquid security, wherein the security similarity scores between the at least one liquid security and the at least one low liquid security are above a similarity threshold; and

substituting the at least one low liquidity security with the at least one liquid security in the strategy.

3. The method of claim 1, further comprising:

generating, using the similarity matrix, a cluster of similar securities, wherein the similar securities in the cluster have security similarity scores above a similarity threshold.

4. The method of claim 3, further comprising:

determining liquid securities in the cluster; and

determining at least one cluster price for the cluster based on the liquid securities.

5. The method of claim 1, further comprising:

generating a portfolio of securities, the portfolio including at least one low liquid security, wherein the low liquid security has liquidity below a liquidity threshold;

identifying, using the similarity matrix, a second security in the second securities having a liquidity greater than the low liquid security, wherein the second security and the low liquid security have a security similarity score above a similarity threshold; and

substituting the low liquid security in the portfolio with the second security.

6. The method of claim 1, wherein determining the similarity score for the each pair of securities further comprises:

assigning the similarity score of one when the each pair of securities are associated with a same leaf node in the leaf nodes of the each tree; or

assigning the similarity score of zero when the each pair of securities is associated with different leaf nodes in the leaf nodes of the each tree.

7. The method of claim 1, wherein determining the similarity score for the each pair of securities further comprises:

determining a deepest common node between the each pair of securities in the each tree; and

determining a depth of the each tree, wherein determining the similarity score is further based on the deepest common node and the depth of the each tree.

8. The method of claim 1, wherein determining the similarity score for the each pair of securities further comprises:

determining a tree importance weight for the each tree in the ensemble of trees; and

adjusting the similarity score for the each pair of securities by the tree importance weight.

9. The method of claim 1, further comprising:

generating a tree to be included in the ensemble of trees;

adding the tree in the ensemble of trees; and

determining a tree importance weight of the tree in the ensemble of trees by: calculating an error of the ensemble of trees before and after adding the tree to the ensemble of trees, wherein determining the tree importance weight is further based on the error.

10. The method of claim 1, wherein the machine learning similarity framework uses a base function, a loss function, and at least one hyperparameter to generate the ensemble of trees.

11. The method of claim 10, further comprising:

determining, using the machine learning similarity framework, a steepest gradient descent of the loss function; and

estimating, based at least in part on the steepest gradient descent, the at least one property of at least one node in the each tree.

12. A system comprising:

a memory configured to store instructions for a machine learning similarity framework;

a processor coupled to the memory and configured to read the instructions from the memory to cause the system to perform operations, the operations comprising: generating, using the machine learning similarity framework, an ensemble of trees using features corresponding to first securities, an objective, a base function, and a loss function; and determining a similarity matrix, the similarity matrix storing security similarity scores indicating similarities between pairs of securities in second securities, wherein determining the similarity matrix comprises: for each tree in the ensemble of trees: propagating the second securities through the each tree until the second securities reach leaf nodes of the each tree, wherein the propagating compares at least one feature associated with each second security in the second securities to at least one property of at least one node associated with the each tree; and determining a similarity score for each pair of securities by calculating a distance between at least one leaf node storing the each pair of securities in the each tree; for the each pair of securities, combining the similarity score from the each tree in the ensemble of trees into a security similarity score; and storing the security similarity scores for the pairs of securities in the similarity matrix.

13. The system of claim 12, wherein the operations further comprise:

generating a strategy, the strategy including at least one low liquidity security;

determining, by comparing the at least one low liquid security to liquid securities in the similarity matrix, at least one liquid security having a security similarity score with the at least one low liquid security above a predefined threshold; and

substituting the at least one low liquid security with the at least one second liquid security in the strategy.

14. The system of claim 12, wherein the operations further comprise:

generating, using the similarity matrix, a cluster of similar securities, securities in the cluster having security similarity scores above a similarity threshold;

determining liquid securities in the cluster, wherein the liquid securities have a liquidity above a liquidity threshold; and

determining at least one cluster price for the cluster based on the liquid securities.

15. The system of claim 12, wherein the operations further comprise:

generating a portfolio of securities, the portfolio including at least one low liquid security;

identifying, using the similarity matrix, a second security that has a liquidity greater than the low liquid security, wherein the second security and the low liquid security have a security similarity score above a threshold; and

substituting the low liquid security in the portfolio with the second security.

16. The system of claim 12, wherein the operations further comprise:

determining a deepest common node in the each tree between the each pair of securities; and

determining a depth of the each tree, wherein determining the similarity score is further based on the deepest common node and the depth of the tree.

17. The system of claim 12, wherein the operations further comprise:

determining a tree importance weight for the each tree in the ensemble of trees; and

adjusting the similarity score for the each pair of securities by the tree importance weight.

18. The system of claim 12, wherein the operations further comprise:

generating a tree to be included in the ensemble of trees;

adding the tree in the ensemble of trees; and

determining a tree importance weight of the tree in the ensemble of trees by: calculating an error of the ensemble of trees before and after adding the tree to the ensemble of trees; wherein determining the tree importance weight is further based on the error.

19. A non-transitory computer-readable medium having instructions thereon, that when executed by a processor, cause the processor to perform operations for determining similarity, the operations comprising:

generating, using a machine learning similarity framework, an ensemble of trees using features corresponding to first securities and a hyperparameter; and

determining a similarity matrix, the similarity matrix storing security similarity scores indicating similarities between pairs of securities in second securities, wherein determining the similarity matrix comprises: for each tree in the ensemble of trees: propagating the second securities through the each tree until the second securities reach leaf nodes of the each tree, wherein the propagating compares at least one feature associated with each second security in the second securities to at least one property of at least one node associated with the each tree; and determining a similarity score for each pair of securities by calculating a distance between the each pair of securities in the each tree; for the each pair the securities, combining the similarity score from the each tree in the ensemble of trees into a security similarity score; and storing the security similarity scores for the pairs of securities in the similarity matrix.

20. The non-transitory computer-readable medium of claim 19, wherein rows and columns of the similarity matrix identify the second securities, and entries in the similarity matrix identify the security similarity scores associated with the pairs of securities.