MACHINE LEARNING ENGINE FOR DETERMINING DATA SIMILARITY
A system and method for training and using a machine-learning similarity framework are provided. During training, the similarity framework generates an ensemble of tress. The trees have different properties at each node. The similarity framework uses the ensemble of trees to determine similarity between objects. The objects are propagated through nodes of each tree in the ensemble of trees until the objects reach leaf nodes. The objects are propagated by comparing the properties at each node of the tree to the features of the objects until the objects reach the leaf nodes. The similarity framework determines a similarity score for a pair of objects in each tree and adjusts the similarity score by tree importance. The object similarity score is determined by combining the similarity scores from multiple trees in the ensemble of trees. The similarity framework generates a similarity matrix that stores object similarity scores for multiple pairs of objects.
This application claims priority under 35 U.S.C. 119 to U.S. Provisional Application No. 63/256,129, filed Oct. 15, 2021, which is hereby incorporated by reference herein in its entirety.
TECHNICAL FIELDThe embodiments are directed to machine learning, and more particularity to a machine learning system for identifying object similarity.
BACKGROUNDConventionally, similarity between two objects is determined using unsupervised learning techniques. These conventional techniques identity features of the objects, transform the features into a high-dimensional feature space, and use a clustering or K-nearest similarity algorithm to identify similarity of the objects.
In one or more implementations, not all of the depicted components in each figure may be required, and one or more implementations may include additional components not shown in a figure. Variations in the arrangement and type of the components may be made without departing from the scope of the subject disclosure. Additional components, different components, or fewer components may be utilized within the scope of the subject disclosure.
DETAILED DESCRIPTIONA similarity framework can be used to identify relationships between objects and evaluate the strength between them. The output of the similarity framework may be a similarity matrix. The similarity matrix is a symmetric matrix of n×n rows and columns representing objects. An element of the matrix is a similarity score between two objects identified by the row and column of the matrix. The similarity score identifies the strength of a relationship between the two objects.
The similarity framework, such as the one described in the embodiments below, may be used to identify similarity between different types of objects. When objects are images, and the similarity framework identifies similar images, one image may replace another to be used in, e.g., image recognition systems. When objects are articles, similar articles may be identified to determine current trends. When objects are documents, similar documents may identify plagiarism. When objects are transactions, similar or dissimilar transactions may identify fraud. The similarity framework may also be used in various natural language processing tasks including text summarization, translation, etc. The similarity framework may be used to identify similar securities and substitute a security of interest with another security with similar characteristics. This has applications in trading and liquidity when, for example, a bond cannot be sourced from the market or, in another example, in portfolio construction where one or more securities may be replaced with other securities that are mostly similar but with more desirable properties or characteristics.
The similarity framework may include a supervised machine learning algorithm, such as a Gradient Boosting Machines (GBM) algorithm. The GBM algorithm may train an ensemble of decision trees using a training dataset that includes features of different objects. Once the similarity framework is trained, the similarity framework receives objects. The objects are propagated through each tree in the ensemble of decision trees until the objects reach the leaf nodes. The GBM algorithm may compute the leaf node of every tree in the ensemble that corresponds to the object. Thereafter, the similarity between two objects is defined as the percentage of trees in the ensemble where the two objects fall into the same leaf node. For example, the similarity framework may assign a similarity score of one when two objects share the same leaf node, otherwise the similarity framework may assign a score of zero. In another example, instead of assigning a score that is zero or one, the score between the two objects in the same tree may vary from zero to one based on the height of the deepest node in the tree that the objects share and the height of the tree. This means that if the two objects share a leaf node, the score may be one, if the objects split at the root, the score may be zero, or if the objects split elsewhere in the tree, the score may be a number between zero and one, based on dc/d where dc is the depth of the deepest node that the objects have in common and d is the depth of the entire tree.
In some embodiments, the similarity framework may assign different weights to the scores from different trees. The weights may be assigned based on the importance of the tree in the ensemble of trees compared to other trees in the ensemble of trees. The weight associated with each tree may be based on a reduction in the training error contributed by that tree to the ensemble of trees.
The output of the similarity framework may be a similarity matrix. The similarity matrix may include object similarity scores for pairs of objects determined from the ensemble of trees. Each object similarity score may be a combination of similarity scores generated by each tree in the ensemble of trees.
Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities. Although illustrated as a single processor 110 and a single memory 120, the embodiments may be executed on multiple processors and stored in multiple memories.
In some examples, memory 120 may include a non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein. In some embodiments, memory 120 may store a similarity framework 130. Similarity framework 130 may be trained using machine learning to identify similarity between objects. Similarity between objects may include a same or similar characteristics or a set of same or similar characteristics that satisfy an objective. Similarity may be quantified by an object similarity score. Similarity framework 130 may receive objects 140 as input. Using the objects 140, similarity framework 130 may generate a similarity matrix 150 that includes similarity scores for the objects. A similarity score between a pair of objects in the similarity matrix 150 may identify similarity between the pair of objects.
In some embodiments, the trees are trained using training loss. For example, if there are k number of trees, the set of training loss (TL) at each step may be defined as follows:
TL=(TL0, TL1, . . . , TLK−1) Equation (1)
In some embodiments, the training loss may be a monotonically decreasing set of numbers that reflects that the training loss decreases with every step or tree added to the ensemble of trees 202. The training loss at each step may be a result of the performance of all the trees that preceded that step.
Similarity framework 130 may be trained to capture an importance of each tree in the ensemble of trees 202. The importance of a tree in the ensemble of trees 202 may be captured using an importance vector. To compute the importance vector, an absolute difference in the training loss is computed as follows:
s0=|TL1−TL0| Equation (2)
si=|TLi−TLi−1|∀ i ∈ {1,2, . . . , K−1} Equation (3)
Using the absolute difference in the training loss, the final importance weight for a tree may then be determined as follows:
Once trees in ensemble of trees 202 are identified and trained and the corresponding weights are determined, similarity framework 130 enters an inference stage. In the inference stage, similarity scores for different objects may be determined. For example, for a given ensemble of trees 202 (also referred to as an ensemble f), similarity framework 130 may determine similarity between two objects X1 and X2 as follows. First, similarity framework 130 may propagate the two objects X1 and X2 down all the trees within ensemble f by comparing features of the objects X1 and X2 to properties of the tree nodes of the trees until objects X1 and X2 reach the leaf nodes. Next, the terminal node position of object X1 and object X2 in each of the leaf nodes of the trees is recorded. Let Z1=(Z11, Z12, . . . , Z1K) be the tree node positions for object X1 and Z2=(Z21, Z22, . . . , Z2K) be tree positions of the leaf nodes for object X2. Then, the similarity S between objects X1 and X2 in a tree may be determined as follows:
S(X1, X2)=Σi=0K−1 I(Z1i==Z2i)wi Equation (5)
where I is the indicator function. The similarity score between objects X1 and X2 in a tree may then defined as:
D(X1, X2)=1−S(X1,X2) Equation (6)
By construction, D is a number that may range from 0 to 1. Similarity framework 130 repeats this process to determine similarity scores for multiple objectives in different trees in ensemble of trees 202, which results in multiple distances or tree scores given by DOBJ1(X1, X2), DOBJ2(X1, X2) and DOBJ3(X1, X2). Similarity framework 130 may combine these distances into a single distance, e.g. similarity score which may be a weighted Euclidean distance which is an overall object similarity score 206, as follows:
Similarity framework 130, may determine similarity between both structured and unstructured objects. When objects X1 and X2 are structured objects, e.g. objects with features that may be found in a particular field in an object or quantified, similarity framework 130 may determine similarity score 206 as discussed above. When objects X1 and X2 are unstructured objects, e.g., objects with features that are qualitative, such as features included in objects that are text, images, etc., similarity framework 130 may first encode the features of unstructured objects X1 and X2 using encoder 204 into encodings. Ensemble of trees 202 may be trained on the encodings and use the encoding to determine similarity score 206.
In some embodiments, similarity framework 130 may determine similarity scores for multiple objects.
with an estimate {circumflex over (f)}(x), such that some specified loss function Ψ(y, f) is minimized, as follows:
The function estimation problem may be re-written in terms of expectations where an equivalent formulation would be to minimize the expected loss function over a response variable Ey(Ψ(y, f(x)), conditioned on the observed explanatory data x:
The response variable y may come from different distributions. This leads to specification of different loss functions Ψ. In particular, if the response variable is binary, i.e., y ∈ {0,1}, the binomial loss function may be considered. If the response variable is continuous, i.e., y ∈ R, the L2 squared loss function or the robust regression Huber loss function may be used. For other response distributions, specific loss functions may be designed. To make the problem of function estimating tractable, the function search space may be restricted to a parametric family of functions f(x, θ). This may change the function optimization problem into the parameter estimation problem:
Similarity framework 130 may use iterative numerical procedures to perform parameter estimation. In some embodiments, given M iteration steps, where M is an integer, the parameter estimates may be written in an incremental form as follows:
{circumflex over (θ)}=Σi=1M {circumflex over (θ)}i Equation (13)
In some embodiments, the steepest gradient descent may be used to estimate parameters. In the steepest gradient descent, given N data points (x, y)i=1N, the empirical loss function J(θ) is decreased over this observed data, as follows:
J(θ)=Σi=1N Ψ(yi, f(xi, {circumflex over (θ)}) Equation (14)
The steepest descent optimization procedure may be based on consecutive improvements along the direction of the gradient of the loss function ∇J(θ). As the parameter estimates {circumflex over (θ)} are presented in an incremental way, the estimate notation is distinguished. By the subscript index of the estimates {circumflex over (θ)}t, the t-th incremental step of the estimate {circumflex over (θ)} is considered. The superscript {circumflex over (θ)}t corresponds to the collapsed estimate of the whole ensemble, i.e., sum of all the estimate increments from step 1 to step t. The steepest descent optimization procedure may be organized as follows.
-
- First, the parameter estimates {circumflex over (θ)}0 are initialized. Then steps two through five are repeated for each iteration t.
- Second, a compiled parameter estimate {circumflex over (θ)}t is obtained from all of the previous iterations, as follows:
{circumflex over (θ)}t=Σi=1t−1 {circumflex over (θ)}i Equation (15)
-
- Third, the gradient of the loss function ∇J(θ) is evaluated, given the obtained parameter estimates of the ensemble:
-
- Fourth, the new incremental parameter estimate {circumflex over (θ)}t is determined as follows:
{circumflex over (θ)}t←∇J(θ) Equation (17)
-
- Fifth, the new estimate {circumflex over (θ)}t is added to the ensemble.
In some embodiments, similarity framework 130 may perform optimization that may occur in a function space. In this case, the function estimate {circumflex over (f)} is parameterized in the additive functional form:
{circumflex over (f)}(x)={circumflex over (f)}M(x)=Σi=0M {circumflex over (f)}i(x) Equation (18)
where M is the number of iterations, {circumflex over (f)}0 is the initial guess and {{circumflex over (f)}i}i=1M are the function increments, also referred to as “boosts”.
In some embodiments, the parameterized “base-learner” functions h(x, θ) may be distinguished from the overall ensemble function estimates {circumflex over (f)}(x). Different families of base-learners functions such as decision trees or splines functions may be selected.
In a “greedy stagewise” approach for incrementing the function with the base-learners, the optimal step-size p may be specified at each iteration. For the function estimate at the t-th iteration, the optimization rule may be defined as follows:
In some embodiments, similarity framework 130 may arbitrarily specify both the loss function Ψ(y, f) and the base-learner functions h(x, θ), on demand. In some embodiments, a new function h(x, θt) may be the most parallel to the negative gradient {gt(xi)}i=1N along the observed data:
In this way, instead of looking for a general solution for the boost increment in the function space, the new function increment may be correlated with −gt(x). This simplifies the optimization task with the least-squares minimization task:
In some embodiments, the GBM algorithm may be trained using Python or another programming language known in the art. The loss function ψ(y, f) may be the L2 loss. The GBM algorithm may train trees on residual vectors or sign vectors.
In some embodiments, the base learner function h(x, θ) that may be used is a decision tree stump and may restrict the total number of leaf nodes to a configurable number, e.g., sixteen leaf nodes.
As illustrated in
and similarity score
the similarity score between objects A and C in tree 302A is
and similarity score between objects B and C in tree 302A is
To determine an optimal hyperparameter, multiple trees may be generated for each hyperparameter and scored. For example, for each hyperparameter, the features may be divided into a training dataset and a validation dataset. The trees, including properties and values of the properties at each node, may be generated with the GBM algorithm using the hyperparameter and the features in the training dataset. The trees may be validated with the features in the validation dataset that validates that objects in the dataset meet a particular objective. The trees may also be scored. After the trees based on the hyperparameter are generated, the hyperparameter may be scored by averaging the scores from the trees. An optimal hyperparameter may be determined using an “argmin” function, or another function, based on the scores associated with different hyperparameters. The “argmin” function, for example, identifies the hyperparameter associated with the lowest hyperparameter score from the scored hyperparameters. The lowest hyper-parameter score simulates the minimal loss discussed above.
As illustrated in
The trees associated with different hyperparameters may include, but not limited to, anywhere from five to three-hundred trees and may have tree depth anywhere from five to sixteen nodes. During training, the features and the properties of the features that are associated with in each node are also determined.
At process 502, features are determined. For example, similarity framework 130 is trained on input data, which may be a training dataset of features. The features may be specific to objects of a particular type and may be extracted from an object. Features may be static, dynamic, or engineered. Static features may be features that do not change with time over a period of time. Dynamic feature may be features that change over a period of time. Engineered features may be created using static and dynamic features. In some embodiments, when objects include unstructured data, static, dynamic, and engineered features may be encoded into structured features using encoder 204.
At process 504, an ensemble of trees is generated. For example, similarity framework 130 may generate an ensemble of trees 202 using features and the GBM algorithm. The trees in the ensemble of trees 202 may be constructed to minimize a variance the features. Specifically, the similarity framework 130 constructs and reconstructs trees using a base function that receives features as input and generates labels, such that the function loss during the reconstruction is minimized Each tree in the ensemble of trees 202 may include one or more properties at each node of each tree with exception of the leaf nodes. As illustrated in
At process 506, tree importance for each tree in the ensemble of trees is determined. For example, similarity framework 130 may determine an importance of each tree in the ensemble of trees 202 by determining the accuracy of the ensemble of trees 202 before and after each tree is added to ensemble of trees 202. The tree importance may correspond to how important the tree is to determining similarity between objects 140. The measure of the importance may be a weight having a value between zero and one.
Once method 500 completes, the similarity framework 130 has generated ensemble of trees 202 and determined the measure of importance of each tree in ensemble of trees 202. At this point, similarity framework 130 may enter an inference stage where the similarity framework 130 determines similarity between objects 140.
At process 602, objects are received. For example, similarity framework 130 receives objects 140. The objects 140 may be the same type of objects that were used to train similarity framework 130 to generate the ensemble of trees 202.
At process 604, objects are propagated through trees in the ensemble of trees. For example, similarity framework 130 propagates objects 140 received in process 602 through each tree in ensemble of trees 202 until objects 140 reach the leaf nodes of the trees. Typically, each object 140 may be propagated through each tree in the ensemble of trees 202. As the objects are propagated, the similarity framework 130 compares the features of the object 140 to properties of the nodes of the tree in the object's path to the leaf node.
At process 606, a similarity score for pairs of objects is determined. For example, similarity framework 130 may determine a similarity score 206 for every two objects. First, similarity framework determines a similarity score for pairs of objects in each tree in ensemble of trees 202. In one instance, the similarity score for a pair of objects in the same tree may be one if the objects share the same leaf node and zero otherwise. In another instance, the similarity score may be a measure of a distance between leaf node(s) of the tree that store the pair of objects. The similarity score for the pair of objects in each tree may be determined based on the tree distance and the tree height. For example, the similarity score may be a measure of the distance from the root node to the last node that the two objects share that is divided by the depth of the tree. In some embodiments, the similarity score is further adjusted based on the tree importance. The object similarity score 206 may then be determined by combining the similarity scores for the pair of objects from each tree in the ensemble of trees 202. Process 606 repeats until similarity framework 130 determines similarity scores 206 for all pairs of objects in objects 140.
At process 608, a similarity matrix is generated. For example, the similarity score 206 for all pairs of objects determined in process 606 is stored in the similarity matrix 150.
Going back to
Objects 140 may be transactions. When objects 140 are transactions, similarity framework 130 may be trained on the training data that includes transaction features for a predefined objective. Once trained, similarity framework 130 may identify, e.g., for a fraud objective, similar and different transactions based on the transaction features. The transactions or a cluster of transactions that similarity framework 130 determines as different, may be considered outlier. An outlier transaction is a transaction that has different features from other transactions, or a transaction that is not included in the training dataset. Outlier transactions may be indicative of fraud. In another example, outlier transactions may be indicative of data errors elsewhere in a transaction processing system. For example, suppose similarity framework 130 is trained on a training dataset that includes previous transactions that passed through a transaction system and that are known to be genuine or include valid data. During an inference stage, similarity framework 130 may identifier outlier transactions that are indicative of transactions that are different from transactions that have previously passed through the transaction system and were included in the training data. Different transactions may be transactions that have similarity score 150 below a similarity threshold for one or more entries in similarity matrix 150. These transactions may be indicative of fraudulent transactions or transactions that included erroneous data.
In some embodiments, similarity framework 130 may quantify uncertainty in the data. For example, similarity framework 130 may be trained on a training dataset that includes objects. Once trained, similarity framework 130 may receive objects and determine how similar an object in objects 140 is to the data in the training dataset. An object that is not similar may be considered an outlier or be out of training distribution. In some instances, similarity framework 130 may also include a classifier. The classifier may indicate how similar object 140 may be to data in the training dataset.
In some embodiments, similarity framework 130 may identify securities with similar characteristics.
In some embodiments, trees in ensemble of trees 202 may be trained using various similarity objectives. Example similarity objectives may include option adjusted spreads (OAS), yield, BAS, and various yield returns, such as 1-week, 2-week, 1-month, 3-month, and 6-month yield returns. There may be one tree in ensemble of trees 202 for each objective or each tree may model multiple similarity objectives, such as a combination of OAS and yield, a combination of OAS, yield, and BAS, or a combination of multiple yield returns. In some embodiments, the trees 202 may include a combination of objectives. An example combination may be 50% OAS and 50% yield. Notably, the combination of different objectives may be flexible and trees 202 may be trained and retrained using different objectives or a combination of objectives.
As discussed above, each tree in ensemble of tree 202 may be trained using various features. To determine similarity between securities 740, similarity framework 130 may train trees in ensemble of trees 202 using features, such as ticker (e.g. an issue of the bond), dtm (days to mature), an industry group that the security belongs to, rating of the security, age of the security, market where the security is originally issued, and/or currency. Notably, the list of features above is not limiting and each tree may be trained on a combination of one or more features.
Similarity matrix 150 may be an n by n matrix that has securities 740 (or a numerical representation of securities 740) as rows and columns. Each entry in similarity matrix 150 identifies a similarity between a pair of securities in securities 740. The similarity entry may have a value between zero and one. A value of one may indicate that the pair of securities are perfectly similar, a value between 0.8 and 1.00 may indicate that the pair of securities are strongly similar, a value between 0.6 and 0.8 may indicate that the pair of securities are moderately similar, a value between 0.8 and 0.6 may indicate that the pair of securities are fairly similar, a value between 0.0 and 0.3 may indicate that the pair of securities are poorly similar, and a value that is zero may indicate that the pair of securities are not similar.
In some embodiments, similarity matrix 150 may be used to generate clusters of similar securities. As illustrated in
Identifying similar securities has numerous benefits. For example, if one security is a bond that cannot be sourced or obtained from a bond market, similarity framework 130 may identify a second security that is a bond with similar characteristics to the first one, but that can be sourced from a bond market. In another example, when a trading portfolio is constructed, similarity framework 130 may identify securities that are similar to securities in the portfolio that that have more desirable properties, such as better tradability and pricing than the securities in the original portfolio. Better tradability may be considered to be greater liquidity of the security, and better pricing may be considered to be a better bid or offer price for the security. The securities identified using similarity framework 130 may then replace the similar securities in the portfolio that cannot be easily sourced or traded. An example security may belong to any asset class, such as equities, mortgage securities, municipal bonds, etc. The similar securities may be selected from similarity matrix 150 or from clusters 706. Similar securities may be securities that have a similarity score above a predefined similarity threshold as determined by similarity framework 130. For example, based on one of more of the factors, a higher similarity score (e.g., greater than 0.8 out of 1.0) may be needed to replace a security with a similar security.
Notably, although examples herein discuss similarity between securities, the embodiments may also be directed to other instruments, such as fund(s), exchange traded securities (EFTs), bonds, etc., and/or a combination of instruments, such as a combination that includes one or more of EFTs, portfolios, single securities and/or funds.
In some instances, to determine liquidity of securities, each security may be scored using a liquidity algorithm. Notably, the liquidity algorithm may be instrument specific, and the liquidity algorithm used to determine the liquidity of securities may be different from the liquidity algorithm used to determine liquidity of funds, EFTs or other types of instruments. The liquidity algorithm may assign a liquidity score to a security that identifies the security as a liquid or a low liquid security. For example, the score may be between 0 and 100, in some embodiments, where the score that is greater than e.g., 70, identifies a security as liquid security 806 and a score that is less than e.g., 30, identifies a security as low liquid security 808. To determine a score for the security, the liquidity algorithm may use one or more rules which are associated with a score. Example rule may be whether a security may or may not be traded, whether security may be traded, but a trade is more expensive as compared to other securities, or whether a buyer pays a premium for liquidity of a security. The liquidity score may be a combination of scores from the one or more rules.
In some embodiments, the low liquidity securities may be passed to similarity framework 130. Similarity framework 130 may be trained on various securities accessible to a trading system 810. Similarity framework 130 may store similarity matrix 150, described above, that has already been generated. Similarity framework 130 may generate similarity matrix 150 at predefined intervals, such as every week, bi-monthly, monthly, etc., on demand, or at specific market conditions. In some embodiments, similarity framework 130 may use similarity matrix 150 to identify liquid securities 806A that are similar (have similarity score that is greater than a predefined similarity threshold) to low liquidity securities 808. Liquid securities 806A may be more liquid than low liquidity securities 808, but have the same or similar objectives and/or the same or similar purchase or sale price, in some embodiments. As discussed above, low liquidity securities 808 may include illiquid bonds that are difficult to trade, while liquid securities 806A may be liquid bonds that are easier to trade, but with the same or similar objective as liquid securities 806A.
Trading system 810 may receive, purchase, and/or sell easily tradeable securities 806 and 806A. Trading system 810 may trade liquid securities 806 and 806A internally, using trading systems, including electronic trading systems of other vendors, financial institutions, broker/dealers, or using external exchanges. In some embodiments, the trading system 810 trading liquid securities 806 and 806A, rather than liquid securities 806 and low liquidity securities 808 may increase execution of orders requested using pre-trade analytics module 804. For example, suppose pre-trading analytics module 804 generates an order that includes securities 806 and 808. Trading system 810 may execute 60% of the order using various exchanges. However, when similarity framework 130 replaces securities 808 in the order with securities 808A, trading system 810 may be able to execute 80% of the order. In another example, investing investment 802 using securities 806 and 806A rather than securities 806 and 808 may allow for an improved transaction cost while maintaining a similar risk for investment 802.
In the embodiments discussed above, the pre-trade analytics module 804 may generate or optimize an investment strategy that may include liquid securities 806 and illiquid securities 808, which may then be substituted using similarity framework 130. Notably, those embodiments are exemplary only, and the embodiments discussed above may also apply to generating and/or optimizing portfolios or portions of portfolios that include multiple securities 806 and 808, where the some or all securities 808 may be substituted for securities 806A. The embodiments may also be applied to optimizing funds or EFTs. Further, although embodiments above describe substituting securities 808 with securities 806A based on liquidity, the embodiments may also apply to substituting securities 808 with similar securities based on a different target. An example target may be an improved transaction cost as compared to securities 808, decreased risk as compared to securities 808, etc.
Using a machine learning algorithm, such as a GBM algorithm, input features 910, and objective(s) 912, similarity framework 130 may be trained to generate ensemble of trees 202. Ensemble of trees 202 may include trees 914A, 914B, 914C, and through 914N. The number of trees 914A-N may be determined using a hyperparameter. Each one of trees 914A-N in ensemble of trees 202 may be trained on different parameters for objective(s) 912, such as different change in spread. For example, tree 914A may be trained on a spread range at time T1, tree 914B may be trained on a spread range at time T2, etc. In some embodiments, similarity framework 130 may generate ensemble of trees 202 as discussed in
Once similarity framework 130 generates ensemble of trees 202, securities 940 may be passed through each tree 914A-N in ensemble of trees 202. Similarity framework 130 may score pairs of securities in securities 940 by determining a similarity score between the two securities in each tree in trees 914A-N. For example, similarity framework 130 may determine a score between a pair of securities for tree 914A, tree 914B, etc. Next, similarity framework 130 may combine the scores for the pair of securities from all trees 914A-N in ensemble of trees 202 into a combined security similarity score and store the combined similarity score in similarity matrix 150. The similarity matrix 150 may store combined scores for different pairs of securities in securities 940. Different ways to determine the security similarity scores are described with reference to
A clustering module 904 may use a clustering algorithm to generate clusters 916A-M of similar securities from the similarity matrix 150. The clusters 916A-M may be configured within clustering module 904. Clusters 916A-M may also include liquid securities, such as securities 806 and exclude low liquidity securities, such as securities 808, in some embodiments. A price prediction module 918 may predict a cluster pricing from each cluster of securities in clusters 916A-M, such that each cluster of securities in clusters 916A-M may be traded at the predicted cluster price 920A-M.
At process 1002, an ensemble of trees in a similarity framework is determined. For example, similarity framework 130 receives input features 910 from securities 940 and one or more trading objectives 912. Based on the input features 910 and training objective(s) 912, similarity framework 130 uses a machine learning algorithm, such as a GBM algorithm, to generate ensemble of trees 202, where each node in each tree stores properties and values that correspond to the properties.
At process 1004, a similarity matrix is determined. For example, similarity framework 130 propagates securities 940 through each one of trees 914A-N in ensemble of trees 202 until securities 940 reach the leaf nodes of the trees 914A-N. Typically, each one of securities 940 may be propagated through each tree in the ensemble of trees 202. As the securities 940 are propagated, the similarity framework 130 compares the features of the security to a corresponding property and the property's value stored in a node of the tree in the security's path. A score for pairs of securities in each tree is determined based on a distance between the leaf node(s) that store the securities in each tree 914A-N. A combined distance for each pair of securities is determined by combining the scores for the each pair of securities from individual trees 914A-N in ensemble of trees 202 into a security similarity score. In some instances, the scores from each one of trees 914A-N may be adjusted by an importance of the corresponding tree as compared to other trees in ensemble of trees 202. Similarity matrix 150 is generated using the security similarity scores from every two securities.
At process 1006, a trading strategy is generated. For example, a pre-trade analytics module 804 may receive investment 802, which may be monetary investment. Using investment 802, pre-trade analytics module 804 may generate an investment strategy that includes one or more securities or a basket of securities. The securities in the investment strategy may be liquid securities 806, low liquidity securities 808, or a combination of both or any other type of security.
At process 1008, low liquid securities are substituted with liquid securities. For example, similarity framework 130 may use similarity matrix 150 to identity liquid securities 806A that are similar, that is, may have the same price and objective as low liquid securities 808, but have higher liquidity. The liquid securities 806A are substituted for low liquidity securities 808 in the investment strategy.
At process 1010, the investment strategy is executed. For example, trading system 810 may execute the investment strategy using liquid securities 806 and liquid securities 806A.
At process 1102, an ensemble of trees in a similarity framework is determined. For example, similarity framework 130 receives input features 910 from securities 940 and one or more trading objective(s) 912. Based on the input features 910 and training objective(s) 912, similarity framework uses a machine learning algorithm, such as a GBM algorithm, to generate ensemble of trees 202, where each node in each tree stores values of at least one input features in input feature 910 that direct a path or a security as the security is propagated through the tree. In some embodiments, leaf nodes may not store the values of at least one input features.
At process 1104, a similarity matrix is determined. For example, similarity framework 130 propagates securities 940 through each one of trees 914A-N in ensemble of trees 202 until securities 940 reach the leaf nodes of the trees 914A-N. Typically, each security in securities 940 may be propagated through each one of trees 914A-N in the ensemble of trees 202. As the securities 940 are propagated, the similarity framework 130 compares the features of the securities 940 to corresponding properties and values of properties stored at each node. A score for the pairs of securities at the leaf nodes in each tree is determined based on the distance between the one or more leaf nodes that store the two securities in each tree. A security similarity score for the pairs of securities is determined by combining the individual scores for the every two securities from individual trees 914A-N in ensemble of trees 202. In some instances, the scores from each one of trees 914A-N may be adjusted by an importance of the trees. Similarity matrix 150 is generated using the security similarity scores from every two securities in securities 940.
At process 1106, clusters of similar securities are determined. For example, clustering module 904 may determine clusters 916A-M from similarity matrix 150. In some embodiments, the low liquidity securities, e.g., securities 808 may be excluded from clusters 916A-M, such that clusters 916A-M may include similar securities that are liquid securities, e.g., securities 806.
At process 1108, clusters of similar securities are priced. For example, price prediction module 918 may determine a cluster price 920A-M for clusters 916A-M.
Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of methods 500, 600, 1000, and 1100. Some common forms of machine readable media that may include the processes of methods 500, 600, 1000, and 1100 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
Claims
1. A method for determining similarity, the method comprising:
- generating, using a machine learning similarity framework, an ensemble of trees using features corresponding to first securities and at least one objective; and
- determining a similarity matrix, the similarity matrix storing security similarity scores indicating similarities between pairs of securities in second securities, wherein determining the similarity matrix comprises: for each tree in the ensemble of trees: propagating the second securities through the each tree until the second securities reach leaf nodes of the each tree, wherein the propagating compares at least one feature associated with each second security in the second securities to at least one property of at least one node associated with the each tree; and determining a similarity score for each pair of securities by calculating a distance between the each pair of securities in the each tree; for the each pair of securities, combining similarity scores from the each tree in the ensemble of trees into a security similarity score; and storing the security similarity scores from the pairs of securities in the similarity matrix.
2. The method of claim 1, further comprising:
- generating a strategy, the strategy including at least one low liquid security, wherein the low liquid security has liquidity below a low liquidity threshold;
- determining, by comparing security similarity scores between the at least one low liquidity security and the second securities in the similarity matrix, at least one liquid security, wherein the security similarity scores between the at least one liquid security and the at least one low liquid security are above a similarity threshold; and
- substituting the at least one low liquidity security with the at least one liquid security in the strategy.
3. The method of claim 1, further comprising:
- generating, using the similarity matrix, a cluster of similar securities, wherein the similar securities in the cluster have security similarity scores above a similarity threshold.
4. The method of claim 3, further comprising:
- determining liquid securities in the cluster; and
- determining at least one cluster price for the cluster based on the liquid securities.
5. The method of claim 1, further comprising:
- generating a portfolio of securities, the portfolio including at least one low liquid security, wherein the low liquid security has liquidity below a liquidity threshold;
- identifying, using the similarity matrix, a second security in the second securities having a liquidity greater than the low liquid security, wherein the second security and the low liquid security have a security similarity score above a similarity threshold; and
- substituting the low liquid security in the portfolio with the second security.
6. The method of claim 1, wherein determining the similarity score for the each pair of securities further comprises:
- assigning the similarity score of one when the each pair of securities are associated with a same leaf node in the leaf nodes of the each tree; or
- assigning the similarity score of zero when the each pair of securities is associated with different leaf nodes in the leaf nodes of the each tree.
7. The method of claim 1, wherein determining the similarity score for the each pair of securities further comprises:
- determining a deepest common node between the each pair of securities in the each tree; and
- determining a depth of the each tree, wherein determining the similarity score is further based on the deepest common node and the depth of the each tree.
8. The method of claim 1, wherein determining the similarity score for the each pair of securities further comprises:
- determining a tree importance weight for the each tree in the ensemble of trees; and
- adjusting the similarity score for the each pair of securities by the tree importance weight.
9. The method of claim 1, further comprising:
- generating a tree to be included in the ensemble of trees;
- adding the tree in the ensemble of trees; and
- determining a tree importance weight of the tree in the ensemble of trees by: calculating an error of the ensemble of trees before and after adding the tree to the ensemble of trees, wherein determining the tree importance weight is further based on the error.
10. The method of claim 1, wherein the machine learning similarity framework uses a base function, a loss function, and at least one hyperparameter to generate the ensemble of trees.
11. The method of claim 10, further comprising:
- determining, using the machine learning similarity framework, a steepest gradient descent of the loss function; and
- estimating, based at least in part on the steepest gradient descent, the at least one property of at least one node in the each tree.
12. A system comprising:
- a memory configured to store instructions for a machine learning similarity framework;
- a processor coupled to the memory and configured to read the instructions from the memory to cause the system to perform operations, the operations comprising: generating, using the machine learning similarity framework, an ensemble of trees using features corresponding to first securities, an objective, a base function, and a loss function; and determining a similarity matrix, the similarity matrix storing security similarity scores indicating similarities between pairs of securities in second securities, wherein determining the similarity matrix comprises: for each tree in the ensemble of trees: propagating the second securities through the each tree until the second securities reach leaf nodes of the each tree, wherein the propagating compares at least one feature associated with each second security in the second securities to at least one property of at least one node associated with the each tree; and determining a similarity score for each pair of securities by calculating a distance between at least one leaf node storing the each pair of securities in the each tree; for the each pair of securities, combining the similarity score from the each tree in the ensemble of trees into a security similarity score; and storing the security similarity scores for the pairs of securities in the similarity matrix.
13. The system of claim 12, wherein the operations further comprise:
- generating a strategy, the strategy including at least one low liquidity security;
- determining, by comparing the at least one low liquid security to liquid securities in the similarity matrix, at least one liquid security having a security similarity score with the at least one low liquid security above a predefined threshold; and
- substituting the at least one low liquid security with the at least one second liquid security in the strategy.
14. The system of claim 12, wherein the operations further comprise:
- generating, using the similarity matrix, a cluster of similar securities, securities in the cluster having security similarity scores above a similarity threshold;
- determining liquid securities in the cluster, wherein the liquid securities have a liquidity above a liquidity threshold; and
- determining at least one cluster price for the cluster based on the liquid securities.
15. The system of claim 12, wherein the operations further comprise:
- generating a portfolio of securities, the portfolio including at least one low liquid security;
- identifying, using the similarity matrix, a second security that has a liquidity greater than the low liquid security, wherein the second security and the low liquid security have a security similarity score above a threshold; and
- substituting the low liquid security in the portfolio with the second security.
16. The system of claim 12, wherein the operations further comprise:
- determining a deepest common node in the each tree between the each pair of securities; and
- determining a depth of the each tree, wherein determining the similarity score is further based on the deepest common node and the depth of the tree.
17. The system of claim 12, wherein the operations further comprise:
- determining a tree importance weight for the each tree in the ensemble of trees; and
- adjusting the similarity score for the each pair of securities by the tree importance weight.
18. The system of claim 12, wherein the operations further comprise:
- generating a tree to be included in the ensemble of trees;
- adding the tree in the ensemble of trees; and
- determining a tree importance weight of the tree in the ensemble of trees by: calculating an error of the ensemble of trees before and after adding the tree to the ensemble of trees; wherein determining the tree importance weight is further based on the error.
19. A non-transitory computer-readable medium having instructions thereon, that when executed by a processor, cause the processor to perform operations for determining similarity, the operations comprising:
- generating, using a machine learning similarity framework, an ensemble of trees using features corresponding to first securities and a hyperparameter; and
- determining a similarity matrix, the similarity matrix storing security similarity scores indicating similarities between pairs of securities in second securities, wherein determining the similarity matrix comprises: for each tree in the ensemble of trees: propagating the second securities through the each tree until the second securities reach leaf nodes of the each tree, wherein the propagating compares at least one feature associated with each second security in the second securities to at least one property of at least one node associated with the each tree; and determining a similarity score for each pair of securities by calculating a distance between the each pair of securities in the each tree; for the each pair the securities, combining the similarity score from the each tree in the ensemble of trees into a security similarity score; and storing the security similarity scores for the pairs of securities in the similarity matrix.
20. The non-transitory computer-readable medium of claim 19, wherein rows and columns of the similarity matrix identify the second securities, and entries in the similarity matrix identify the security similarity scores associated with the pairs of securities.
Type: Application
Filed: Jan 25, 2022
Publication Date: Apr 20, 2023
Inventors: Philip Frederik Sommer (New York, NY), Stefano Pasquali (New York, NY), Jerinsh Jeyapaulraj (Harrison, NJ), Yu-Li Chu (New York, NY)
Application Number: 17/583,917