Classifying Data Using Machine Learning
Techniques for data classification include receiving, at a local computing system, a query from a remote computing system, the query comprising data associated with a commodity, the data comprising one or more attributes of the commodity; matching the one or more attributes of the commodity with one or more terms of a plurality of terms in a word matrix that includes a plurality of nodes that each include a term of the plurality of terms and a plurality of links that each connect two or more nodes and define a similarity between the two or more nodes; generating, based on the matching, a numerical vector for the business enterprise commodity; identifying one or more classification regions that each define a classification of the commodity; and preparing the classifications for display at the remote computing system.
Latest Business Objects Software Limited Patents:
This disclosure relates to classifying data and, more particularly, classifying data using an adaptive learning machine.
BACKGROUNDA piece of data can be classified by assigning the data into one or more categories of a given number of categories. For example, goods and services can be classified into categories that are represented by category codes by assigning data that represent the goods and services into one or more categories. To classify a good or service, a textual description of the good or service may be converted into a corresponding category code. In some instances, business enterprises may find it helpful to classify purchases by categories of goods and services. Moreover, some business enterprises may find it helpful for such goods and services to be rated, e.g., for quality or otherwise. Given the very large numbers of goods and services, and more so categories of such goods and services, determining which category a particular good or service falls into, if any, may be difficult.
SUMMARYThis disclosure describes systems, methods, apparatus, and computer-readable media for classifying data, such as data that represent commodities, using an adaptive learning machine including, for example, the features of receiving, at a local computing system, a query from a business enterprise computing system, the query including data associated with a business enterprise commodity, the data including one or more attributes of the business enterprise commodity; matching the one or more attributes of the business enterprise commodity with one or more terms of a plurality of terms in a word matrix, the word matrix including: a plurality of nodes that each include a term of the plurality of terms; and a plurality of links that each connect two or more nodes and define a similarity between the two or more nodes; generating, based on the matching, a numerical vector for the business enterprise commodity; identifying, based on the numerical vector, one or more classification regions that each define a classification of the business enterprise commodity; and preparing the classifications of the business enterprise commodity of the one or more identified classification regions for display at the business enterprise computing system.
A first aspect combinable with any of the general embodiments includes receiving a set of terms, each term of the set of terms is labeled with a correct classification; matching the one or more attributes of the business enterprise commodity with one or more terms of the set of terms; generating the numerical vector based on matching the one or more attributes of the business enterprise commodity with one or more terms of the plurality of terms in the word matrix and one or more terms of the set of terms; identifying one or more classifications of the business enterprise commodity based on the numerical vector; and preparing the one or more classifications of the business enterprise commodity for display at the business enterprise computing system.
A second aspect combinable with any of the previous aspects further includes prior to receiving the query from the business enterprise computing system, building the word matrix.
In a third aspect combinable with any of the previous aspects, building the word matrix includes searching for content associated with a plurality of business enterprise commodities; parsing the content into the plurality of terms to define the plurality of nodes; and applying a semantic proximity model to the plurality of terms to define the plurality of links, wherein a link that connects two or more nodes defines a semantic similarity between the two or more nodes.
In a fourth aspect combinable with any of the previous aspects, building the word matrix further includes applying a string similarity model to map a term of the plurality of terms into a similar term, wherein a link that connects two or more nodes defines a string similarity between the two or more nodes.
In a fifth aspect combinable with any of the previous aspects, the classification is defined by a first classification level in a plurality of classification levels defined in a classification hierarchy.
In a sixth aspect combinable with any of the previous aspects, the plurality of classification levels include a segment classification level, a family classification level, a class classification level, a commodity classification level, and a business function classification level.
In a seventh aspect combinable with any of the previous aspects, the classification includes the commodity classification level.
An eighth aspect combinable with any of the previous aspects further includes transmitting the classifications of the business enterprise commodity of the one or more identified classification regions for display at the business enterprise computing system; receiving a selection of one of the classifications of the business enterprise commodity from the business enterprise computing system; and updating the word matrix based on the received selection.
In a ninth aspect combinable with any of the previous aspects, updating the word matrix based on the received selection includes creating a direct link between the nodes including terms matching the one or more attributes.
A system of one or more computers can be configured to perform particular actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
Particular embodiments of the subject matter described in this disclosure can be implemented so as to realize none, one, or more of the following advantages. An adaptive learning machine for classifying data can convert textual descriptions into corresponding category codes where the textual descriptions are unstructured or contain very few terms, or the terms in the textual descriptions are abbreviated or misspelled. The adaptive learning machine may use an external corpus of text documents to augment the set of labeled data. The external corpus of text documents is used in an unsupervised way to derive string similarities and/or semantic similarities. These similarity models are then used by the adaptive learning machine together with labeled data to build classification models. The classification models can be extended for classifying data that is represented by textual descriptions in different languages by using a multi-language external corpus of text documents to derive word similarities. Moreover, the use of an un-labeled external corpus of text documents reduces the need for costly labeled data.
These general and specific aspects may be implemented using a device, system or method, or any combinations of devices, systems, or methods. The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
In a general embodiment of the present disclosure, an adaptive learning machine uses unsupervised learning and supervised learning to classify data. The data represents an entity, such as a person, place, thing, data record, word, or the like. For unsupervised learning, the adaptive learning machine uses an external corpus of text documents to build a word matrix. The word matrix is a term-to-term similarity matrix generated using semantic similarity and/or string similarity derived from an external corpus of text documents. The word matrix includes nodes and links. A node contains a term, and a link connects two or more nodes and defines a semantic proximity or a string similarity between the terms in the linked nodes. For supervised learning, the adaptive learning machine receives a set of terms that have been labeled with the correct classifications. The adaptive learning machine uses the word matrix, the set of labeled terms, and/or additional attributes associated with the data to classify the data.
In some embodiments, the adaptive learning machine receives a query that includes data that describes, for example, attributes of a commodity, from a client and assigns the attributes of the commodity to one or more classifications using the word matrix, the set of labeled terms, and/or additional attributes associated with the commodity. The adaptive learning machine assigns the attributes of the commodity to one or more classifications by, for example, matching the attributes of the commodity with one or more terms in the word matrix or the set of labeled terms. Based on the matching, the adaptive learning machine determines a classification of the commodity. After the adaptive learning machine determines the classification, the adaptive learning machine prepares the classification of the commodity for display at the client.
Turning to the example implementation of
In general, the adaptive learning machine 102 may be a server that stores one or more hosted applications 114, where at least a portion of the hosted applications 114 are executed via requests and responses sent to users or clients within and communicably coupled to the illustrated environment 100 of
In some instances, the server 102 may store a plurality of various hosted applications 114, while in other instances, the server 102 may be a dedicated server meant to store and execute only a single hosted application 114. In some instances, the server 102 may include a web server, where the hosted applications 114 represent one or more web-based applications accessed and executed via network 132 by the clients 135 of the system to perform the programmed tasks or operations of the hosted application 114. At a high level, the server 102 includes an electronic computing device operable to receive, transmit, process, store, or manage data and information associated with the environment 100. Specifically, the server 102 illustrated in
In addition to requests from the external clients 135 illustrated in
In the present implementation, and as shown in
Generally, the network 132 facilitates wireless or wireline communications between the components of the environment 100 (i.e., between the server 102 and the clients 135), as well as with any other local or remote computer, such as additional clients, servers, or other devices communicably coupled to network 132 but not illustrated in
Further, all or a portion of the network 132 can include either a wireline or wireless link. Example wireless links may include 802.11 a/b/g/n, 802.20, WiMax, and/or any other appropriate wireless link. In other words, the network 132 encompasses any internal or external network, networks, sub-network, or combination thereof operable to facilitate communications between various computing components inside and outside the illustrated environment 100. The network 132 may communicate, for example, Internet Protocol (IP) packets, Frame Relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, and other suitable information between network addresses. The network 132 may also include one or more local area networks (LANs), radio access networks (RANs), metropolitan area networks (MANs), wide area networks (WANs), all or a portion of the Internet, and/or any other communication system or systems at one or more locations.
As illustrated in
At a high level, each of the one or more hosted applications 114 is any application, program, module, process, or other software that may execute, change, delete, generate, or otherwise manage information according to the present disclosure, particularly in response to and in connection with one or more requests received from the illustrated clients 135 and their associated client applications 144. In certain cases, only one hosted application 114 may be located at a particular server 102. In others, a plurality of related and/or unrelated hosted applications 114 may be stored at a single server 102, or located across a plurality of other servers 102, as well. In certain cases, environment 100 may implement a composite hosted application 114. For example, portions of the composite application may be implemented as Enterprise Java Beans (EJBs) or design-time components may have the ability to generate run-time implementations into different platforms, such as J2EE (Java 2 Platform, Enterprise Edition), ABAP (Advanced Business Application Programming) objects, or Microsoft's .NET, among others. Additionally, the hosted applications 114 may represent web-based applications accessed and executed by remote clients 135 or client applications 144 via the network 132 (e.g., through the Internet). Further, while illustrated as internal to server 102, one or more processes associated with a particular hosted application 114 may be stored, referenced, or executed remotely. For example, a portion of a particular hosted application 114 may be a web service associated with the application that is remotely called, while another portion of the hosted application 114 may be an interface object or agent bundled for processing at a remote client 135. Moreover, any or all of the hosted applications 114 may be a child or sub-module of another software module or enterprise application (not illustrated) without departing from the scope of this disclosure. Still further, portions of the hosted application 114 may be executed by a user working directly at server 102, as well as remotely at client 135.
The illustrated server 102 also includes memory 117. Memory 117 may include any memory or database module and may take the form of volatile or non-volatile memory including, without limitation, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), removable media, or any other suitable local or remote memory component. Memory 117 may store various objects or data, including classes, frameworks, applications, backup data, business objects, jobs, web pages, web page templates, database tables, repositories storing business and/or dynamic information, and any other appropriate information including any parameters, variables, algorithms, instructions, rules, constraints, or references thereto associated with the purposes of the server 102 and its one or more hosted applications 114. For example, memory 117 may store a word matrix 120 that includes nodes containing terms and links defining a semantic proximity and/or string similarity between terms in the linked nodes. The memory 117 may store a set of terms labeled with the correct classification that is received from a user of the server 102. Additionally, memory 117 may include any other appropriate data, such as VPN applications, firmware logs and policies, firewall policies, a security or access log, print or other reporting files, as well as others.
The illustrated environment of
Moreover, while each client 135 is described in terms of being used by a single user, this disclosure contemplates that many users may use one computer, or that one user may use multiple computers. As used in this disclosure, client 135 is intended to encompass a personal computer, touch screen terminal, workstation, network computer, kiosk, wireless data port, smart phone, personal data assistant (PDA), one or more processors within these or other devices, or any other suitable processing device. For example, each client 135 may include a computer that includes an input device, such as a keypad, touch screen, mouse, or other device that can accept user information, and an output device that conveys information associated with the operation of the server 102 (and hosted application 114) or the client 135 itself, including digital data, visual information, the client application 144, or the GUI 138. Both the input and output device may include fixed or removable storage media such as a magnetic storage media, CD-ROM, or other suitable media to both receive input from and provide output to users of the clients 135 through the display, namely, the GUI 138.
Further, the illustrated client 135 includes a GUI 138 including a graphical user interface operable to interface with at least a portion of environment 100 for any suitable purpose, including generating a visual representation of the client application 144 (in some instances, the client's web browser) and the interactions with the hosted application 114, including the responses received from the hosted application 114 received in response to the requests sent by the client application 144. Generally, through the GUI 138, the user is provided with an efficient and user-friendly presentation of data provided by or communicated within the system. The term “graphical user interface,” or GUI, may be used in the singular or the plural to describe one or more graphical user interfaces and each of the displays of a particular graphical user interface. Therefore, the GUI 138 can represent any graphical user interface, including but not limited to, a web browser, touch screen, or command line interface (CLI) that processes information in environment 100 and efficiently presents the information results to the user.
In general, the GUI 138 may include a plurality of user interface (UI) elements, some or all associated with the client application 144, such as interactive fields, pull-down lists, and buttons operable by the user at client 135. These and other UI elements may be related to or represent the functions of the client application 144, as well as other software applications executing at the client 135. In particular, the GUI 138 may be used to present the client-based perspective of the hosted application 114, and may be used (as a web browser or using the client application 144 as a web browser) to view and navigate the hosted application 114, as well as various web pages located both internal and external to the server, some of which may be associated with the hosted application 114. For purposes of the present location, the GUI 138 may be a part of or the entirety of the client application 144, while also merely a tool for displaying the visual representation of the client and hosted applications' 114 actions and interactions. In some instances, the GUI 138 and the client application 144 may be used interchangeably, particularly when the client application 144 represents a web browser associated with the hosted application 114.
While
To build a classification model for classifying commodities, the adaptive learning machine uses a classification system.
Referring again to
To determine the nodes 405 and the links 410 of the word matrix 400, the adaptive learning machine searches for content associated with commodities. For example, the adaptive learning machine uses a thesaurus to identify semantically related terms. From a thesaurus (for example), the adaptive learning machine determines that the words “image” and “picture” in
The adaptive learning machine, in some embodiments, applies a semantic proximity model to the terms to define the links 410 between the nodes 405. Because product descriptions are sometimes abbreviated or misspelled, the adaptive learning machine can use string similarity to map a term into a similar term. The adaptive learning machine can combine string similarity and semantic similarity techniques to generate the word matrix 400. The word matrix 400 can be generated for classifying commodities that are described using different languages. In this case, the links 410 connect nodes 405 containing semantically related terms in different languages.
In some instances, the adaptive learning machine determines similarity scores between nodes of the word matrix based on the number of links between nodes. For example, the adaptive learning machine determines similarity scores from a number of links between the nodes that contain terms associated with a commodity. In some instances, the adaptive learning machine determines similarity scores between nodes of the word matrix based on the number of unique paths. For example, in the word matrix of
Referring again to
By generating a semantic vector for each term, the adaptive learning machine can use the word matrix to convert a textual description of a commodity into a semantic vector. For example, the adaptive learning machine can use the similarity scores to convert the textual description of the commodity into the semantic vector. The semantic vector can be a high dimensional vector representation (e.g., a few hundred numbers) of the textual description of the commodity.
The adaptive learning machine receives a set of terms that have been labeled with the correct classifications from a user of the adaptive learning machine, at step 203. The adaptive learning machine can receive additional information related to commodities, such as vendor information, a name of a company that manufactures the commodity, an industry code, price, weight, or dimensions. For example, the adaptive learning may receive the set of terms “a Lenovo T510 laptop computer, 16 GB RAM, 200 GB disk” and the corresponding classification of “computing device.”
The adaptive learning machine generates a classification vector that represents the commodity description, at step 204. The adaptive learning machine combines the semantic vector for each term of the commodity description generated from the word matrix with additional dimensions that represent additional properties of the commodity (e.g., the set of labeled terms, and/or the additional information) to generate the classification vector that represents the commodity description. The classification vector is associated with the correct classification that was received from the user.
The classification vectors of a classification model can be represented graphically.
Referring again to
The adaptive learning machine receives a query from a client, e.g., a business enterprise computing system, at step 602. The query includes data associated with a commodity. The data includes one or more attributes of the commodity. For example, the adaptive learning machine receives a query for “Canon EOS 12.2MP CMOS Digital SLR.” The attributes of the commodity are the terms in the query, such as “Canon,” “EOS,” “12.2MP,” “CMOS,” “Digital,” and “SLR.”
At step 604, the adaptive learning machine matches one or more attributes of the commodity with one or more terms in a word matrix and/or one or more terms in a set of labeled terms. For the example query “Canon EOS 12.2MP CMOS Digital SLR,” the adaptive learning machine searches a word matrix, such as word matrix 400 shown in
Referring again to
The adaptive learning machine identifies one or more classifications of the commodity, at step 608. The adaptive learning machine identifies one or more classifications based on the numerical vector corresponding to the attributes of the commodity. For example, the adaptive learning machine identifies a classification region of a classification model, e.g., the classification model 500 of
In some instances, the adaptive learning machine can identify another possible classification of the commodity. The other classification can be in a different classification level, e.g., the commodity classification level. For example, the adaptive learning machine can identify the classification “Digital Cameras” in the commodity classification level 308 in
Referring again to
Referring again to
A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made. For example, other methods described herein besides or in addition to that illustrated in
Claims
1. A method performed with a distributed computing system for classifying one or more commodities, the method comprising:
- receiving, at a local computing system, a query from a business enterprise computing system, the query comprising data associated with a business enterprise commodity, the data comprising one or more attributes of the business enterprise commodity;
- matching the one or more attributes of the business enterprise commodity with one or more terms of a plurality of terms in a word matrix, the word matrix comprising: a plurality of nodes that each comprise a term of the plurality of terms; and a plurality of links that each connect two or more nodes and define a similarity between the two or more nodes;
- generating, based on the matching, a numerical vector for the business enterprise commodity;
- identifying, based on the numerical vector, one or more classification regions that each define a classification of the business enterprise commodity; and
- preparing the classifications of the business enterprise commodity of the one or more identified classification regions for display at the business enterprise computing system.
2. The method of claim 1, further comprising:
- receiving a set of terms, each term of the set of terms is labeled with a correct classification;
- matching the one or more attributes of the business enterprise commodity with one or more terms of the set of terms;
- generating the numerical vector based on matching the one or more attributes of the business enterprise commodity with one or more terms of the plurality of terms in the word matrix and one or more terms of the set of terms;
- identifying one or more classifications of the business enterprise commodity based on the numerical vector; and
- preparing the one or more classifications of the business enterprise commodity for display at the business enterprise computing system.
3. The method of claim 1, further comprising:
- prior to receiving the query from the business enterprise computing system, building the word matrix.
4. The method of claim 3, wherein building the word matrix comprises:
- searching for content associated with a plurality of business enterprise commodities;
- parsing the content into the plurality of terms to define the plurality of nodes; and
- applying a semantic proximity model to the plurality of terms to define the plurality of links, wherein a link that connects two or more nodes defines a semantic similarity between the two or more nodes.
5. The method of claim 4, wherein building the word matrix further comprises:
- applying a string similarity model to map a term of the plurality of terms into a similar term, wherein a link that connects two or more nodes defines a string similarity between the two or more nodes.
6. The method of claim 1, wherein the classification is defined by a first classification level in a plurality of classification levels defined in a classification hierarchy.
7. The method of claim 6, wherein the plurality of classification levels comprise a segment classification level, a family classification level, a class classification level, a commodity classification level, and a business function classification level.
8. The method of claim 7, wherein the classification comprises the commodity classification level.
9. The method of claim 1, further comprising:
- transmitting the classifications of the business enterprise commodity of the one or more identified classification regions for display at the business enterprise computing system;
- receiving a selection of one of the classifications of the business enterprise commodity from the business enterprise computing system; and
- updating the word matrix based on the received selection.
10. The method of claim 9, wherein updating the word matrix based on the received selection comprises:
- creating a direct link between the nodes comprising terms matching the one or more attributes.
11. A computer storage medium encoded with a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- receiving, at a local computing system, a query from a business enterprise computing system, the query comprising data associated with a business enterprise commodity, the data comprising one or more attributes of the business enterprise commodity;
- matching the one or more attributes of the business enterprise commodity with one or more terms of a plurality of terms in a word matrix, the word matrix comprising:
- a plurality of nodes that each comprise a term of the plurality of terms; and
- a plurality of links that each connect two or more nodes and define a similarity between the two or more nodes;
- generating, based on the matching, a numerical vector for the business enterprise commodity;
- identifying, based on the numerical vector, one or more classification regions that each define a classification of the business enterprise commodity; and
- preparing the classifications of the business enterprise commodity of the one or more identified classification regions for display at the business enterprise computing system.
12. The computer storage medium of claim 11, wherein the operations further comprise:
- receiving a set of terms, each term of the set of terms is labeled with a correct classification;
- matching the one or more attributes of the business enterprise commodity with one or more terms of the set of terms;
- generating the numerical vector based on matching the one or more attributes of the business enterprise commodity with one or more terms of the plurality of terms in the word matrix and one or more terms of the set of terms;
- identifying one or more classifications of the business enterprise commodity based on the numerical vector; and
- preparing the one or more classifications of the business enterprise commodity for display at the business enterprise computing system.
13. The computer storage medium of claim 11, wherein the operations further comprise:
- prior to receiving the query from the business enterprise computing system, building the word matrix.
14. The computer storage medium of claim 13, wherein building the word matrix comprises:
- searching for content associated with a plurality of business enterprise commodities;
- parsing the content into the plurality of terms to define the plurality of nodes; and
- applying a semantic proximity model to the plurality of terms to define the plurality of links, wherein a link that connects two or more nodes defines a semantic similarity between the two or more nodes.
15. The computer storage medium of claim 14, wherein building the word matrix further comprises:
- applying a string similarity model to map a term of the plurality of terms into a similar term, wherein a link that connects two or more nodes defines a string similarity between the two or more nodes.
16. The computer storage medium of claim 11, wherein the operations further comprise:
- transmitting the classifications of the business enterprise commodity of the one or more identified classification regions for display at the business enterprise computing system;
- receiving a selection of one of the classifications of the business enterprise commodity from the business enterprise computing system; and
- updating the word matrix based on the received selection.
17. A system of one or more computers configured to perform operations comprising:
- receiving, at a local computing system, a query from a business enterprise computing system, the query comprising data associated with a business enterprise commodity, the data comprising one or more attributes of the business enterprise commodity;
- matching the one or more attributes of the business enterprise commodity with one or more terms of a plurality of terms in a word matrix, the word matrix comprising: a plurality of nodes that each comprise a term of the plurality of terms; and a plurality of links that each connect two or more nodes and define a similarity between the two or more nodes;
- generating, based on the matching, a numerical vector for the business enterprise commodity;
- identifying, based on the numerical vector, one or more classification regions that each define a classification of the business enterprise commodity; and
- preparing the classifications of the business enterprise commodity of the one or more identified classification regions for display at the business enterprise computing system.
18. The system of claim 17, further comprising:
- receiving a set of terms, each term of the set of terms is labeled with a correct classification;
- matching the one or more attributes of the business enterprise commodity with one or more terms of the set of terms;
- generating the numerical vector based on matching the one or more attributes of the business enterprise commodity with one or more terms of the plurality of terms in the word matrix and one or more terms of the set of terms;
- identifying one or more classifications of the business enterprise commodity based on the numerical vector; and
- preparing the one or more classifications of the business enterprise commodity for display at the business enterprise computing system.
19. The system of claim 17, further comprising:
- prior to receiving the query from the business enterprise computing system, searching for content associated with a plurality of business enterprise commodities;
- parsing the content into the plurality of terms to define the plurality of nodes; and
- applying a semantic proximity model to the plurality of terms to define the plurality of links, wherein a link that connects two or more nodes defines a semantic similarity between the two or more nodes.
20. The system of claim 17, further comprising:
- transmitting the classifications of the business enterprise commodity of the one or more identified classification regions for display at the business enterprise computing system;
- receiving a selection of one of the classifications of the business enterprise commodity from the business enterprise computing system; and
- updating the word matrix based on the received selection.
Type: Application
Filed: Jan 31, 2012
Publication Date: Aug 1, 2013
Applicant: Business Objects Software Limited (Dublin)
Inventor: Sherif Botros (Palo Alto, CA)
Application Number: 13/362,598
International Classification: G06F 17/30 (20060101);