ENSEMBLE CLASSIFICATION ALGORITHMS HAVING SUBCLASS RESOLUTION
Ensemble classification algorithms having subclass resolution are disclosed. An example disclosed apparatus includes a fingerprint generator to generate a fingerprint of class probabilities of each of a plurality of samples, a distribution creator to create a distribution of the samples based on the generated fingerprints, and a distribution applicator to apply the distribution to a population to predict sub-class probabilities of each of the population.
This patent arises from a continuation of U.S. patent application Ser. No. 15/387,302, (now U.S. Pat. No. 11,250,339) which was filed on Dec. 21, 2016, which claims the benefit of U.S. Provisional Application 62/353,341, which was filed Jun. 22, 2016. U.S. patent application Ser. No. 15/387,302 and U.S. Provisional Patent Application No. 62/353,341 are hereby incorporated herein by reference in their entireties. Priority to U.S. patent application Ser. No. 15/387,302 and U.S. Provisional Patent Application No. 62/353,341 is hereby claimed.
FIELD OF THE DISCLOSUREThis disclosure relates generally to classification algorithms, and, more particularly, to ensemble classification algorithms having subclass resolution.
BACKGROUNDIn recent years, classifying subjects/samples (e.g., persons) to determine their demographics and/or sub-classifications (e.g., integer ages) has been typically accomplished using binary decision trees with tree nodes that end in terminal nodes. In particular, these tree nodes are used to sub-classify groups of people into demographic categories and/or sub-groups. However, these tree nodes can inaccurately separate and/or isolate characteristics by use of the decision trees, thereby leading to inaccuracies. Further, binary decision trees can require significant computational resources and/or time to generate.
The figures are not to scale. Instead, to clarify multiple layers and regions, the thickness of the layers may be enlarged in the drawings. Wherever possible, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts.
DETAILED DESCRIPTIONEnsemble classification algorithms having subclass resolution are disclosed. Typically, in known examples, to classify a group of subjects (e.g., persons) by sub-categories (e.g., integer age, interests, affiliations, etc.), binary decision trees are used to determine the sub-categories of primary class categories (e.g., an age range, an age group, etc.). In such known examples, the group of subjects represented as input data are recursively split until a terminal node is reached to provide subclass determinations. These binary decision trees are susceptible to inaccuracies resulting from divisions at their branching nodes. In particular, for known binary decision trees, primary class (e.g., an age group) prediction has shown to be effective. However, accurate subclass (e.g., an integer age) determination and/or subclass probabilities have been difficult to achieve. Additionally, utilization of branching decision trees/nodes may require significant computational capabilities as well as significant computing time.
Methods and apparatus disclosed herein leverage ensemble methods while enabling very accurate and computationally efficient determinations and/or probabilistic determinations of category subclasses of a population (e.g., an unknown population). In particular, the examples disclosed herein utilize fingerprint generation, distribution creation and distribution application to the population to predict probabilities of attributes and/or subclass categories of the population. According to examples disclosed herein, the fingerprint generation may be performed by calculating primary subclass probabilities of primary classes (e.g., an age group) of a group of samples. According to the examples disclosed herein, a distribution and/or a probabilistic distribution of sub-classes is created based on the fingerprint generated. In some examples, distribution creation occurs by arranging each of the samples and removing samples in which their respective predicted primary class (e.g., age range) does not match their actual primary class. In some examples, once a distribution has been created, the distribution is compared to calculated fingerprints of individuals of a population to predict subclass probabilities and/or subclass probability distributions for at least a portion (e.g., all) of the population.
To determine sub-class probability distributions (e.g., integer age distributions, interest probabilities, affiliation probabilities, etc.) of a group/population of entities (e.g., persons, groups of people, an unknown group of entities, etc.), the data collector 102 of the illustrated example receives data from the devices 114 and/or the collection facility 116 via the network 112 and the data collector 102. For example, the data collector 102 may extract and/or be provided with data pertaining to users (e.g., age information, interest information, affiliation and/or other demographic information) of the respective devices 114. In this example, this data, which may include known samples and/or control data samples, is provided to the fingerprint generator 106, which generates a primary class fingerprint for each sample of the data received via the network 112. The primary class fingerprint may be a number (e.g., an integer) representing a characteristic and/or category of each sample. In some examples, the primary class fingerprint may represent an interest or affiliation.
In turn, the example distribution creator 108 generates a probability distribution of sub-classes based on the data and, in turn, the example distribution applicator 110 compares the generated probability distribution to the group/population to determine probability distributions of each of the group/population.
In some examples, the distribution is generated or created by removing samples that have different actual primary classes from predicted primary classes to create a more accurate fingerprint and/or collection of fingerprints and/or resultant probability distributions. In some examples, the samples are arranged and/or sorted in an array to generate the distribution. In some examples, the array is pivoted to define a sub-class probability distribution based on attributes and/or classifications of the samples.
To generate a fingerprint for each sample of a collection of samples that are based on representative sample data (e.g., a known collection/population of samples, a verified population of samples, a sample group, etc.), the fingerprint generator 106 of the illustrated example calculates primary class probabilities based on received samples (e.g., a fingerprint trained ensemble) received at the fingerprint generator 106 from the data collector 102.
According to the illustrated example, a distribution is created based on each sample of the representative sample data by the distribution creator 108. In particular, the example distribution creator 108 and/or the probability calculator 202 generates the calculated distribution represented by a table and/or array including each of the samples to related fingerprints to subclass probabilities. To determine and/or characterize the calculated distribution 214 of an unknown sample (e.g., an unknown sample of individuals, an unknown population group, etc.), the aforementioned calculated distribution 214 is used by the example distribution applicator 110 to generate a probability distribution for each of the unknown samples. In some examples, the distribution organizer/pivoter 204 pivots the table to generate the probability distribution. In this example, once the distribution applicator 110 calculates a probability distribution for each of the unknown sample, the distribution calculator 110 forwards the calculated probability distribution to the network 112 and/or the collection facility 116 shown in
While an example manner of implementing the example population data analyzer 104 is illustrated in
Flowcharts representative of example machine readable instructions for implementing the population data analyzer 104 of
As mentioned above, the example processes of
The example method 300 of
According to the illustrated example of
In this example, a fingerprint of primary class probabilities is generated (block 304). The fingerprint generator 106 of the illustrated example utilizes the known samples to generate primary class probabilities which, in turn, are used to generate a fingerprint of a corresponding to each sample of the known samples. In particular, the primary class (e.g., age range groups) probabilities are calculated by determining primary class probabilities, p. As a result, a fingerprint, f, corresponding to each sample is calculated by Equation 1 below:
f=Σi=1cri*n(c−i) (1)
, where r is the class number (e.g., numbered based on a ranked order of the respective probabilities, p), n is a total number of classes and c is a configurable parameter to allow for a bias variance trade-off. In particular, the configurable parameter, c, can be used to reduce and/or eliminate errors related to statistical bias and/or variance. In some examples, the probabilities, p, for each class number, r, are sorted (e.g., r1 is a class number of the class with the highest probability while rn is a class number with the lowest probability). In this example, a fingerprint and/or a fingerprint value is calculated for each of the samples. However, in some examples, a fingerprint is calculated for a portion of the samples (e.g., based on sample reliability, relevance, significance and/or lack of correlation of the samples).
Next, a distribution of samples based on the generated fingerprint is created (block 306). As mentioned above in connection with
A population is then retrieved (block 308). In particular, an unknown population of samples is retrieved and/or received by the distribution applicator 110 and/or the data collector 102 so that the unknown population can be characterized (e.g., characterized in regards to sub-class probabilities).
Next, the created distribution is applied to predict subclass probabilities of each of the population (block 310). As discussed in greater detail below in connection with
Next, it is determined if additional unknown sample groups and/or individual samples are to be applied with the predicted sub-class probabilities (block 312). This determination may be made based on the appropriateness of the predicted sub-class probabilities related to potential new unknown sample(s). For example, in the context of the potential new unknown sample(s), the known samples may be inappropriate, irrelevant and/or include different fingerprints and/or fingerprint values from the current unknown samples. If additional populations and/or unknown sample(s) are to be characterized (block 312), control of the process returns to block 308. Otherwise, control of the process proceeds to block 314.
It is then determined whether a new fingerprint is to be generated (block 314). This determination may occur based on a current relevance of the generated fingerprint and/or its corresponding sub-class probability distribution. Additionally or alternatively, this determination may be based on receiving a new unknown sample group with attributes that do not pertain to the current fingerprint and/or sub-class probability distribution. If a new fingerprint is to be generated, control of the process returns to block 302. Otherwise, the process ends.
Turning to
First, the sample data is parsed (block 402). For example, data related to the known samples is organized and/or revised, as appropriate, by the example distribution creator 108 so that this data can be processed and/or organized. In some examples, the sample data is re-organized and/or reformatted to be placed in an array.
Next, according to the illustrated example, the distribution creator 108, the probability calculator 202 and/or the distribution organizer/pivoter 204 organizes, reformats and/or arranges individual samples of the known samples in an array and/or table (block 404). To illustrate generating this table, referring briefly to
According to the illustrated example, the distribution creator 108 compares a true (e.g., known) primary class with a predicted primary class for each sample of the known samples (e.g., samples of a known and/or verified group) (block 406).
In some examples, a weighting factor is assigned to the samples of the array (block 407). For examples, this weighting can be used to provide greater significance to some of the individual samples (e.g., higher reliability samples, etc.) relative to the others to be used in probability distribution generation.
In this example, it is then determined if there are rows of the array in which the true primary class does not match the corresponding predicted primary class (block 408). If there are rows with such a mismatch (block 408), control of the process proceeds to block 410. Otherwise, the process proceeds to block 412.
In some examples, if there are rows with mismatches, the rows are removed from the array (block 410) and the process proceeds to block 412. In particular, individual samples in which a corresponding true primary class does not match a predicted primary class, are removed from the array or table. Briefly referring to
Next, according to the illustrated example, a distribution array is generated (block 412) and the process ends/returns. In this example, the distribution array is generated by the example distribution organizer/pivoter 204 and/or the probability calculator 202 pivoting the table and/or array. For example, as shown in
In some examples, this calculated distribution is stored in the example storage 212 as the calculated distribution 214. In some examples, at least a portion of the individual samples of the known samples are weighted (e.g., some samples have a higher weight in predicting the subclass probabilities, weighted samples) to generate the distribution array. In particular, the samples with higher reliabilities (e.g., samples from more reliable sources) may carry a higher weight in the probability calculations.
Turning to
In this example, the distribution applicator 110 calculates primary class probabilities of the unknown population (block 502).
Next, the distribution applicator 110 uses the calculated primary class probabilities to create fingerprint values of each of the unknown population (block 504). In this example the distribution applicator 110 and/or the fingerprint generator 106 calculates fingerprint values of each of the unknown samples. In particular, these fingerprint values may be calculated using Equation 1 described above in connection with
Next, the distribution applicator 110 compares the calculated fingerprints to the generated distribution array (block 506) and the process ends/returns. In particular, the fingerprints are compared to the calculated distribution 214 to determine sub-class probabilities of each of the unknown samples. In some examples, sub-classes and/or sub-class designations are assigned to each of the unknown samples based on the predicted sub-class probabilities. Additionally or alternatively, a subclass can be randomly assigned to individual sample(s) of the unknown sample based on the calculated distribution 214.
The processor platform 700 of the illustrated example includes a processor 712. The processor 712 of the illustrated example is hardware. For example, the processor 712 can be implemented by one or more integrated circuits, logic circuits, microprocessors or controllers from any desired family or manufacturer.
The processor 712 of the illustrated example includes a local memory 713 (e.g., a cache). The example processor 712 also includes the example fingerprint generator 106, the example distribution creator 108, the example distribution applicator 110, the example probability calculator 202 and the example distribution organizer/pivoter 204. The processor 712 of the illustrated example is in communication with a main memory including a volatile memory 714 and a non-volatile memory 716 via a bus 718. The volatile memory 714 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS Dynamic Random Access Memory (RDRAM) and/or any other type of random access memory device. The non-volatile memory 716 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 714, 716 is controlled by a memory controller.
The processor platform 700 of the illustrated example also includes an interface circuit 720. The interface circuit 720 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB), and/or a PCI express interface.
In the illustrated example, one or more input devices 722 are connected to the interface circuit 720. The input device(s) 722 permit(s) a user to enter data and commands into the processor 712. The input device(s) can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, isopoint and/or a voice recognition system.
One or more output devices 724 are also connected to the interface circuit 720 of the illustrated example. The output devices 724 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display, a cathode ray tube display (CRT), a touchscreen, a tactile output device, a printer and/or speakers). The interface circuit 720 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip or a graphics driver processor.
The interface circuit 720 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem and/or network interface card to facilitate exchange of data with external machines (e.g., computing devices of any kind) via a network 726 (e.g., an Ethernet connection, a digital subscriber line (DSL), a telephone line, coaxial cable, a cellular telephone system, etc.).
The processor platform 700 of the illustrated example also includes one or more mass storage devices 728 for storing software and/or data. Examples of such mass storage devices 728 include floppy disk drives, hard drive disks, compact disk drives, Blu-ray disk drives, RAID systems, and digital versatile disk (DVD) drives.
The coded instructions 732 of
From the foregoing, it will be appreciated that the above disclosed methods and apparatus provide an effective and computationally efficient manner of calculating subclass probabilities. For example, the examples disclosed herein may utilize probability generation instead of binary trees, which require numerous levels of division before a terminal node is reached, thereby improving computational efficiency.
This application claims the benefit under 35 U.S.C. § 119(e) to U.S. Provisional Application 62/353,341 titled “Subclass Resolution for Ensemble Classification Algorithms,” filed Jun. 22, 2016, which is incorporated herein by this reference in its entirety.
Although certain example methods, apparatus and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus and articles of manufacture fairly falling within the scope of the claims of this patent.
Claims
1. An apparatus, comprising:
- at least one memory;
- instructions; and
- processor circuitry to execute the instructions to: generate fingerprints of primary class probabilities of known samples, the known samples associated with survey data; remove a first portion of the known samples in which a predicted primary class is different from a known primary class to define a second portion of the known samples in which a predicted primary class is identical to a known primary class; create a distribution of predicted subclass probabilities of the second portion of the known samples, the distribution of predicted subclass probabilities based on the generated fingerprints; and apply the distribution of the predicted subclass probabilities to unknown samples to determine a sub-class of ones of the unknown samples.
2. The apparatus of claim 1, wherein the processor circuitry is to execute the instructions to generate the fingerprints based on class numbers that are ranked according to the respective primary class probabilities.
3. The apparatus as defined in claim 2, wherein the processor circuitry is to execute the instructions to generate the fingerprints based on a configurable parameter that can reduce at least one of a statistical bias or variance of the fingerprint.
4. The apparatus as defined in claim 1, wherein ones of the fingerprints represent a characteristic, category, interest, or affiliation of a respective one of the known samples.
5. The apparatus as defined in claim 1, wherein the processor circuitry is to execute the instructions to remove a third portion of the known samples from the second portion of the known samples based on at least one of a reliability, a relevance, or a significance of the third portion of the known samples.
6. The apparatus as defined in claim 1, wherein the processor circuitry is execute the instructions to create the distribution by pivoting a table based on at least one fingerprint value.
7. The apparatus as defined in claim 1, wherein the processor circuitry is to execute the instructions to generate respective ones of the fingerprints by arranging the known samples in an array.
8. The apparatus as defined in claim 7, wherein the processor circuitry is to execute the instructions to pivot the array based on attributes of the known samples.
9. The apparatus as defined in claim 7, wherein the processor circuitry is to execute the instructions to generate the distribution by weighting at least one of the known samples.
10. A method comprising:
- retrieving, by executing instructions with at least one processor, known samples associated with survey data from a database;
- generating, by executing instructions with the at least one processor, fingerprints of primary class probabilities of the known samples;
- removing, by executing instructions with the at least one processor, a first portion of the known samples in which a predicted primary class is different from a known primary class to define a second portion of the known samples in which a predicted primary class is identical to a known primary class;
- creating, by executing instructions with the at least one processor, a distribution of predicted subclass probabilities of the second portion of the known samples, the distribution of predicted subclass probabilities based on the generated fingerprints; and
- applying, by executing instructions with the at least one processor, the distribution of the predicted subclass probabilities to unknown samples to determine a sub-class of ones of the unknown samples.
11. The method as defined in claim 10, wherein the generating of the fingerprints is based on class numbers that are ranked according to the respective primary class probabilities.
12. The method as defined in claim 11, wherein the generating of the fingerprints is based on a configurable parameter that can reduce at least one of a statistical bias or variance of the fingerprint.
13. The method as defined in claim 10, wherein ones of the fingerprints represent a characteristic, category, interest, or affiliation of a respective one of the known samples.
14. The method as defined in claim 10, wherein the creating of the distribution includes pivoting a table based on at least one fingerprint.
15. The method as defined in claim 10, wherein the generating of the fingerprints includes arranging the known samples in an array.
16. The method as defined in claim 15, wherein the creating of the distribution of the subclass probabilities of the known samples includes weighting, by executing an instruction with the processor, at least one of the known samples.
17. A non-transitory machine readable medium comprising instructions, which when executed, cause at least one processor to at least:
- access known samples associated with survey data from a database via a network;
- generate fingerprints of primary class probabilities of the known samples;
- remove a first portion of the known samples in which a predicted primary class is different from a known primary class to define a second portion of the known samples in which a predicted primary class is identical to a known primary class;
- create a distribution of predicted subclass probabilities of the second portion of the known samples, the distribution of predicted subclass probabilities based on the generated fingerprints; and
- apply the distribution of the predicted subclass probabilities to unknown samples to determine a sub-class of ones of the unknown samples.
18. The machine readable medium as defined in claim 17, wherein the instructions, when executed, cause the at least one processor to generate the fingerprints based on class numbers that are ranked according to respective primary class probabilities.
19. The machine readable medium as defined in claim 17, wherein the instructions, when executed, cause the at least one processor to generate the fingerprints based on a configurable parameter that can reduce at least one of a statistical bias or variance of the fingerprint.
20. The machine readable medium as defined in claim 17, wherein ones of the fingerprints represent a characteristic, category, interest, or affiliation of a respective one of the known samples.
21. The machine readable medium as defined in claim 17, wherein the instructions, when executed, cause the at least one processor to create the distribution by pivoting a table based on at least one fingerprint.
22. The machine readable medium as defined in claim 17, wherein the instructions, when executed, cause the at least one processor to generate the fingerprints by arranging the known samples in an array.
23. The machine readable medium as defined in claim 22, wherein the instructions, when executed, cause the at least one processor to create the distribution based on weighted samples of the known samples.
Type: Application
Filed: Jan 31, 2022
Publication Date: May 19, 2022
Inventors: Jonathan Sullivan (Hurricane, UT), Evan Brydon (New York, NY)
Application Number: 17/589,691