METHODS AND APPARATUS TO IMPROVE DECISION TREE EXECUTION
Methods, apparatus, systems and articles of manufacture are disclosed to improve decision tree execution. An example method includes retrieving, with a processor, a decision tree logic expression in a sum-of-products (SOP) form, the decision tree logic expression consuming a first duration to evaluate a dataset, eliminating, with the processor, redundant variables of the decision tree logic expression by transforming the decision tree logic expression into a product-of-sums (POS) form, and evaluating, with the processor, the data set with the decision tree logic expression in the POS form, the decision tree logic expression in the POS form consuming a second duration to evaluate the data set that is less than the first duration.
This patent claims the benefit of, and priority to, U.S. Provisional Application Ser. No. 62/140,005, entitled “Surrogate Splitting Optimization” and filed on Mar. 30, 2015, which is hereby incorporated herein by reference in its entirety.
FIELD OF THE DISCLOSUREThis disclosure relates generally to data analysis, and, more particularly, to methods and apparatus to improve decision tree execution.
BACKGROUNDIn recent years, social media applications and panelist media consumption behavior applications have grown to produce relatively large quantities of data. Attempts to analyze the data from such applications is aided by decision trees, which include test nodes that analyze a particular variable from the data for a conditional result. For example, if an analyst attempts to obtain a subset of the data related to males less than twenty-five years old, then a first node may analyze the data with a male/female test node to test an inequality expression.
When the male/female test node evaluation results in “TRUE,” then the corresponding data points are associated with male panelists. Additionally, the resulting subset of data associated with the male panelists may further proceed to an age test node that tests data points for an inequality of “Less Than 25.” When the age test node results in a true statement, then that resulting subset of data points is associated with individuals (e.g., panelists) that are both male and less than twenty-five years old.
Decision trees may be used with data analysis related to any number and/or type of application, such as set-top-box data analysis, online campaign ratings data analysis and/or social media data analysis. Decision trees include methodologies, techniques and/or algorithms to create nested binary splits on one variable at a time, in which such methodologies repeat for two or more branches until a stopping criterion is met. Decision trees include nodes that, when evaluated with test criteria, result in leaf nodes (sometimes referred to herein as leaves or sub-nodes). Tests at the decision nodes may include a comparison to a threshold value to determine a result of an inequality (e.g., greater than, less than, equal to, etc.). Each sub-node includes results in a tree split after the test, which reflects an associated quantity of data and/or subset of data points. For example, an originating data set having a first (originating) node to test for male or female individuals may split with approximately half of the original data points residing in each of two leaf nodes (e.g., one resulting leaf node for males and one resulting leaf node for females).
The data tested by decision trees may be associated with individuals, such as users of an application, members of a demographically designed panelist group, subscribers to a service, etc. In some examples, an individual user may be associated with any number of variables, such as an age, a gender, a marital status, a number of children, an income value, etc. As such, a decision tree may reference one or more variables when propagating from an originating node through one or more sub-nodes to develop conclusions associated with an originating data set. In some examples, the decision tree includes an objective, such as determining an age or average age of a group of people that have viewed an advertisement. Particular variables associated with each user are evaluated in the decision tree such that a final node reflects the average age of people who have viewed the advertisement. However, in some examples the originating data set includes missing data for one or more variables. In the event a decision tree cannot continue to propagate beyond missing data and/or data that cannot be tested as either greater than, less than or equal to a threshold value, then a regression towards a mean-value may occur, resulting in erroneous conclusions and/or results.
To avoid problems related to tree propagation when missing and/or non-testable variables occur in the originating data set, traditional decision tree techniques provide a surrogate splitting functionality. In circumstances where a splitting variable (e.g., a primary splitting variable) is not provided, not testable, NULL, and/or otherwise unavailable, a surrogate splitting variable is used to mimic a behavior of the primary splitting variable. During tree propagation, the traditional decision tree technique evaluates the surrogate splitting variable to determine a particular surrogate value to be used during propagation, thereby allowing one or more subsequent nodes to be tested during evaluation of an originating data set. The surrogate splitting variable used by the traditional decision tree technique provides an improved propagation robustness, but increases an amount of time required to propagate through the decision tree. In examples where a relatively large number of users is included in the originating data set (e.g., analyzing Facebook® users, analyzing Tencent® instant messaging service users, etc.), delays in tree propagation may become problematic when attempting to produce real-time or near real-time results/conclusions related to the originating data set. In other words, as each surrogate value is encountered by the decision tree, a corresponding processing burden occurs that can, in the aggregate, dramatically increase tree propagation time.
As described above, traditional decision tree analysis/propagation techniques evaluate NaN variables at runtime as they occur in a decision tree sequence. As such, the NaN variable must be evaluated to determine a substitute value suitable to allow tree propagation to continue. However, even after a first NaN variable occurs during tree propagation, that same NaN variable may need to be re-computed in a separate and/or subsequent portion of the same or one or more different trees. To illustrate,
Examples disclosed herein increase a computation speed of a decision tree by modifying tree logic in a manner that reduces a number of computations while maintaining the truth outcome of the original tree. As described in further detail below, examples disclosed herein (a) simplify decision tree Boolean algebra, (b) precompute NaN variables to reduce in-line processing during tree propagation, (c) rearrange branch execution order, and (d) verify completeness of non-NaN variables.
In operation, the example decision tree interface 204 retrieves, receives and/or otherwise obtains one or more code blocks associated with decision tree logic, such as the example original code block 100 of
Continuing the example above, in the event an analyst associated with the market service also wants to know an average age of only those subscribers that observed the advertisement, then further subsequent branches of decision tree logic may perform inequality tests for each subscriber to determine which resulting node they belong in. For instance, one test node may identify subgroups of subscribers based on a test of the age thirty-four, which splits the group into a subgroup of subscribers younger than thirty-four years of age, and a subgroup of subscribers equal to or older than thirty-four years of age. Additionally, for the subgroup younger than thirty-four years of age, a further subsequent test node may check for those subscribers younger than thirteen years of age, which results in one subgroup of subscribers younger than thirteen and one subgroup of subscribers older than or equal to thirteen (but less than thirty-four years of age because the source data for that respective branch only included subscribers less than thirty-four years of age).
As the example decision tree propagates, such as the trees associated with the example original code block 100 of
While traditional decision trees and corresponding native code blocks (e.g., the example original code block 100 of
To illustrate,
One benefit of expressing decision trees in the POS form is that terms can be Boolean factorized to yield a single evaluation of any inequality/test for one or more trees. Code blocks transformed into the POS form are retrieved and/or otherwise received by the example tree factorization engine 208 to identify one or more variables that can be factored out of respective decision trees. The example tree factorization engine 208 identifies, for example, a variable that is common to two or more sub-expressions of a POS transformed decision tree. Because the POS transformed decision tree includes sub-expressions that are logically ANDed with other sub-expressions, any common variables identified by the example tree factorization engine 208 may be factored out to simplify the overall decision tree expression. As such, the example tree factorization generates a simplified version of the expression.
As described above, evaluation of NaN variables during tree propagation imposes a corresponding computational burden and slows a speed at which the tree can propagate. The example NaN computation engine 212 evaluates retrieved and/or otherwise received code blocks (e.g., code blocks that have been transformed from SOP to POS format, code blocks that have been factorized, original code blocks) prior to tree propagation to identify surrogate/NaN variables that could cause such computational burdens. If the example NaN computation engine 212 identifies a NaN variable, that NaN variable is corrected based on the context of the variable type. In some examples, a missing variable value is replaced with a default value, while in other examples a floating point value is rounded up or down when the variable type is integer (e.g., 12.3 is rounded down to 12).
Additionally, the example NaN computation engine 212 evaluates the context in which the NaN variable is used so that it can be converted into a binary value rather than a value to be used in a relatively more computationally demanding inequality test. For example, if the example NaN variable having a corresponding value of 12 (e.g., after rounding down from 12.3 to 12) is to be used in a decision node that tests for an age greater than twenty-one, the example NaN computation engine 212 converts the NaN variable to a Boolean FALSE value. As such, during tree propagation a Boolean test may occur, which is computationally faster than an inequality test on the value 12.
The example branch rearrangement engine 214 evaluates code blocks to identify an optimized order of decision tree propagation. In operation, the example branch rearrangement engine 214 invokes the example factorization engine 208 to select a code block of interest, such as a code block that has been transformed and factorized. The example branch counter 216 determines occurrence rates for each branch within the selected code block, and the example rank engine 218 ranks the branches (e.g., branch_1, branch_2, etc.) based on a frequency of occurrence. In some examples, decision tree propagation orders are prioritized based on corresponding rank values for branches containing NaN variables first, and then branches containing non-NaN variables (e.g., “regular” variables) thereafter. In other words, because the NaN variables have been converted into Boolean values, branches of decision trees that evaluate the NaN variables may execute relatively faster than regular variables that may require an inequality test. As such, overall code block execution speed may be improved by rearranging the branches associated with NaN variables first.
In some examples, NaN variable prioritization is not invoked and, instead, branches associated with the most frequently occurring variables (either NaN variables or regular variables) are prioritized to be evaluated prior to branches associated with relatively lower frequency of occurrence. For example, assuming that variable A2 is identified to occur more frequently than variable A1, then rearranging/swapping decision tree execution to force branches having variable A2 to be evaluated before branches having variable A1 results in processing clock cycles overall.
While an original code block received and/or otherwise retrieved by the example decision tree interface 204 includes identified instances of NaN variables, in some examples the first pass tree engine 222 of the example market service 220 may not have identified all such instances of NaN variables. To verify variable completeness, the example tree factorization engine 208 selects a code block of interest, and the example variable evaluator 210 identifies a regular variable (e.g., not designated as NaN) to determine whether the variable contains a valid value. In some examples, a valid value is determined based on whether the variable value is missing or is NULL. In some examples, a valid value is determined by performing a type check of the variable, such as a variable parameter that identifies a type (e.g., text, integer, float, etc.), and checks the value to verify it is consistent with the variable type. For example if a variable type parameter reflects type INT and the value includes text (e.g., “green”), then the example variable evaluator 210 reclassifies the regular variable as a NaN. When one or more variables are reclassified as a NaN variable, the example NaN computation engine 212 applies a correction to the associated value, and redefines the variable as a Boolean type to improve execution speed during tree propagation, as described above.
While an example manner of implementing the segmentation optimizer 202 of
Flowcharts representative of example machine readable instructions for implementing the segmentation optimizer 202 of
As mentioned above, the example processes of
The program 500 of
The example tree factorization engine 208 factorizes the example transformed code block to factor out duplicative variables that may be present in the decision tree expression (block 506). Results of factorization performed by the example tree factorization engine 208 may be stored for later access in the example optimizer storage 226. As described above, factoring out duplicative variables enables a corresponding reduction in computational operations that need to occur during propagation of the factorized expression. The example NaN computation engine 212 pre-computes NaN variables included in decision trees from the first pass tree engine 222 to enable further computational benefits and tree propagation speed improvements (block 508). Additional tree propagation speed improvements result when the example branch arrangement engine 214 rearranges an order in which one or more trees is evaluated (block 510). While the example first pass tree engine 222 included NaN variables present in the code blocks, the example variable evaluator 210 evaluates all remaining original variables to ensure that future tree propagation efforts are not slowed down and/or otherwise halted by an errant variable value (block 512). Code blocks that (a) have been transformed, (b) have been factorized, (c) pre-computed the NaN variables, and (d) verified all original variables for completeness, are released back to the example market service 220 so that analysis of data to be used with the code blocks can be performed.
In some examples, the decision tree interface 204 transmits and/or otherwise makes available the optimized code blocks to the example decision tree storage 224 of the example market service 220 (block 514). Because source data from the example market service 220 may be dynamic and/or otherwise frequently changing, the example segmentation optimizer 202 determines whether to re-analyze the code blocks and/or receives a request from the market service 220 to re-analyze the code blocks (block 516). Control then returns to block 502.
When the example NaN computation engine 212 identifies a NaN variable in the obtained code block of interest (block 604), the NaN computation engine 212 evaluates a value associated with the NaN variable to apply a correction value (block 608). As described above, the example NaN computation engine 212 evaluates the context in which the NaN variable is used to apply the correction value, which may include a rounding operation to a floating value when an integer is required. In other examples, a correction value may include a default value, while in still other examples the correction value may include a replacement of text data with an integer, or vice-versa. Additionally, to further increase a speed of decision tree propagation, the corrected NaN value is evaluated in its context of an associated decision tree node to convert the NaN variable value to a Boolean statement (e.g., TRUE or FALSE, 1 or 0). If the context of the associated decision tree node is, for example, to establish whether an age value is above or below a threshold value (e.g., twenty-one years old), then the example NaN computation engine 212 replaces the integer value of the NaN variable with the Boolean statement based on whether the previous NaN variable value satisfies that threshold (block 610). Continuing with the example threshold value of twenty-one years of age, if the original NaN variable value was twelve, then the example NaN computation engine 212 assigns a Boolean statement of FALSE to the NaN variable. As such, when tree propagation occurs at a later time, fewer computational resources are required to resolve a truth test of that NaN variable because simple binary logic tests of TRUE or FALSE consume fewer computational resources as compared with logical tests of inequalities. In still other examples, in the event a decision tree does not include surrogate variables, optimization analysis can be avoided and the original decision trees may be used.
For example, if a regular variable includes a NULL value or a missing variable, then any future attempts to evaluate that variable during tree propagation will fail. If the example variable evaluator 210 identifies a variable value as invalid (block 806), then the variable is reclassified as NaN, and control returns to block 608 of
The processor platform 900 of the illustrated example includes a processor 912. The processor 912 of the illustrated example is hardware. For example, the processor 912 can be implemented by one or more integrated circuits, logic circuits, microprocessors or controllers from any desired family or manufacturer.
The processor 912 of the illustrated example includes a local memory 913 (e.g., a cache). The processor 912 of the illustrated example is in communication with a main memory including a volatile memory 914 and a non-volatile memory 916 via a bus 918. The volatile memory 914 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS Dynamic Random Access Memory (RDRAM) and/or any other type of random access memory device. The non-volatile memory 916 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 914, 916 is controlled by a memory controller.
The processor platform 900 of the illustrated example also includes an interface circuit 920. The interface circuit 920 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB), and/or a PCI express interface.
In the illustrated example, one or more input devices 922 are connected to the interface circuit 920. The input device(s) 922 permit(s) a user to enter data and commands into the processor 1012. The input device(s) can be implemented by, for example, a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, isopoint and/or a voice recognition system.
One or more output devices 924 are also connected to the interface circuit 920 of the illustrated example. The output devices 924 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display, a cathode ray tube display (CRT), a touchscreen, a tactile output device, a printer and/or speakers). The interface circuit 920 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip or a graphics driver processor.
The interface circuit 920 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem and/or network interface card to facilitate exchange of data with external machines (e.g., computing devices of any kind) via a network 926 (e.g., an Ethernet connection, a digital subscriber line (DSL), a telephone line, coaxial cable, a cellular telephone system, etc.).
The processor platform 900 of the illustrated example also includes one or more mass storage devices 928 for storing software and/or data. Examples of such mass storage devices 928 include floppy disk drives, hard drive disks, compact disk drives, Blu-ray disk drives, RAID systems, and digital versatile disk (DVD) drives.
The coded instructions 932 of
From the foregoing, it will be appreciated that the above disclosed methods, apparatus and articles of manufacture reduce computational burdens associated with decision tree evaluation, particularly when relatively large data sets are used with the decision trees. Examples disclosed herein identify decision tree expressions that lack an optimized form and, when identified, convert such expressions into a form that permits reduction of duplicative variable computation. Additionally, examples disclosed herein permit expression factorization to eliminate additional duplicative variables that may occur in a conjunctive normal form. While traditional decision trees may identify surrogate variables when values associated with those variables would, if uncorrected, halt further propagation of decision trees, examples disclosed herein further analyze the surrogate variables to increase tree propagation speed by converting them to binary true/false values. Moreover, examples disclosed herein apply such conversions to the surrogate variables prior to decision tree propagation so that propagation runtime is not halted and/or otherwise slowed down to process computationally intensive inequalities on the fly.
Although certain example methods, apparatus and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus and articles of manufacture fairly falling within the scope of the claims of this patent.
Claims
1.-27. (canceled)
28. A system to reduce decision tree evaluation time, the system comprising:
- means for retrieving a decision tree logic expression in a sum-of-products (SOP) form, the SOP form including first subexpressions having a logical OR relationship, the decision tree logic expression consuming a first duration to evaluate a data set; and
- means for transforming to eliminate redundant variables of the decision tree logic expression by transforming the decision tree logic expression into a product-of-sums (POS) form, the POS form including second subexpressions having a logical AND relationship, the transforming means to enable the evaluation of the data set with the decision tree logic expression in the POS form, the decision tree logic expression in the POS form consuming a second duration to evaluate the data set that is less than the first duration.
29. The system as defined in claim 28, further including first means for identifying to identify a first variable that is common to at least two of the first subexpressions.
30. The system as defined in claim 29, wherein the first identifying means is to generate a simplified version of the transformed decision tree logic expression that factors out the first variable common to the at least two of the first subexpressions.
31. The system as defined in claim 28, further including second means for identifying to identify surrogate variables to identify surrogate values associated with the transformed decision tree logic expression, the surrogate variables including a non-Boolean value.
32. The system as defined in claim 31, wherein the second identifying means is to replace the non-Boolean value with a correction value when the non-Boolean value cannot be evaluated with an inequality test.
33. The system as defined in claim 32, wherein the inequality test includes at least one of a greater-than inequality test, a less-than inequality test, or an equal test.
34. The system as defined in claim 32, wherein the inequality test is associated with a threshold value.
35. The system as defined in claim 32, wherein the non-Boolean value is at least one of NULL or missing.
36. The system as defined in claim 32, wherein the second identifying means is to reduce an evaluation computation burden on a processor by assigning the correction value as a binary value.
37. The system as defined in claim 28, further including means for verifying to verify non-surrogate variables based on examining a value corresponding to a respective variable associated with the transformed decision tree logic expression.
38. The system as defined in claim 37, wherein the verifying means is to examine a value corresponding to a respective variable associated with the transformed decision tree logic expression by determining whether a value associated with a variable is at least one of NULL or missing.
39. The system as defined in claim 28, further including means for rearranging to rearrange branches in the decision tree logic expression using a ranking of respective branches in the decision tree logic expression.
40. The system as defined in claim 39, further including means for ranking to rank branches in the decision tree logic expression using a frequency of occurrence for respective branches in the decision tree logic expression.
41. The system as defined in claim 40, wherein the ranking means is to prioritize respective branches containing surrogate variables.
42. The system as defined in claim 39, wherein the ranking means is to rank branches in the decision tree logic expression using a frequency of occurrence for respective variables in the decision tree logic expression.
43. The system as defined in claim 42, wherein the ranking means is to prioritize respective branches with variables that have a higher frequency of occurrence in the decision tree logic expression.
Type: Application
Filed: Aug 30, 2019
Publication Date: Dec 19, 2019
Inventors: Jonathan Sullivan (Natick, MA), Michael Sheppard (Brooklyn, NY), Peter Lipa (Tucson, AZ)
Application Number: 16/557,541