KEYWORD AND BUSINESS TAG EXTRACTION
A system to extract relevant keywords or business tags that describe a company's business is provided. The keyword extraction system utilizes a smart crawler to identify and crawl product pages from a company's website. These pages serve to provide textual descriptions of product offerings, solutions, or services that make up the company's business. The keyword extraction system combines these web documents with other textual descriptions of companies, e.g. from third party data vendors or other public data sources and company databases, to form a corpus of documents that describe companies. The corpus of documents and keywords are processed to segment the plurality of companies into subsets by applying a clustering technique and to provide visualization of the clusters with business tags.
This patent application is a continuation of U.S. application Ser. No. 15/689,942, filed Aug. 29, 2017, which claims priority from U.S. Provisional Application No. 62/380,908 filed on Aug. 29, 2016, which are incorporated by reference herein.
FIELDImplementations disclosed herein relate, in general, to information management technology and specifically to semantic analytics technology.
BACKGROUNDMarketing strategies commonly involve dividing a broad market of prospects into subsets or segments of prospects that have characteristics in common, in the hope that they will have common needs, interests, or priorities. In the case that prospects are individual human consumers, such characteristics can include, but are not limited to, demographic information about the age, sex, race, religion, occupation, income, or education level, geographic information about the prospect's location within regions, countries, states, cities, neighborhoods, or other locales, and behavioral and psychographic information about the lifestyle, attitude towards and response to certain products or other stimuli. In the case that prospects are companies, e.g. in business-to-business (B2B) marketing, such characteristics commonly include firmographic information, such as the company size, revenue, industry, and location. Marketers can apply strategies that are specialized for each segment, e.g. by creating messaging content or advertisements that resonate with, or are more relevant to the target prospect, which lead to much better conversion rates.
In the same vein, sales development teams and account executives achieve better outcomes if they research the prospect's background or characteristics and personalize their outreach efforts. As an example, in B2B situations, providing a case study or success story about a current customer similar to the prospect company is a powerful strategy to convince the prospect to purchase a product or service because it provides evidence of previous success and reduces the perceived risk by the prospect. The ability to semantically describe, group, and identify similar companies can be viewed as a form of business micro-segmentation that is much more specific than segmenting using broad industry labels to describe prospect companies, and is in turn more powerful and actionable.
SUMMARYA system to extract relevant keywords or business tags that describe a company's business is provided. The keyword extraction system utilizes a smart crawler to identify and crawl product pages from a company's website. These pages serve to provide textual descriptions of product offerings, solutions, or services that make up the company's business. The keyword extraction system combines these web documents with other textual descriptions of companies, e.g. from third party data vendors or other public data sources and company databases, to form a corpus of documents that describe companies. The corpus of documents and keywords are processed to segment the plurality of companies into subsets by applying a clustering technique and to provide visualization of the clusters with business tags.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Other features, details, utilities, and advantages of the claimed subject matter will be apparent from the following more particular written Detailed Description of various embodiments and implementations as further illustrated in the accompanying drawings and defined in the appended claims.
A further understanding of the nature and advantages of the present technology may be realized by reference to the figures, which are described in the remaining portion of the specification. In the figures, like reference numerals are used throughout several figures to refer to similar components. In some instances, a reference numeral may have an associated sub-label consisting of a lower-case letter to denote one of multiple similar components. When reference is made to a reference numeral without specification of a sub-label, the reference is intended to refer to all such multiple similar components.
Disclosed herein is an automated system and method to extract relevant keywords (i.e. business tags) that describe a company's business.
These pages of the companies' website serve to provide textual descriptions of product offerings, solutions, or services that make up the companies' business. For example, a web page of a target company that is in the business of selling footwear may provide information about what kind of footwear the target company is selling, the price point of the footwear, target market for the footwear, etc. The operation 102 identifies a number of keywords related to various companies. The operation 102 performs smart crawling in that it determines which pages are appropriate for crawling, which keywords are appropriate, etc. For example, the operation 102 may determine that it is important to crawl product page but it is not necessary to crawl a terms and conditions page. Similarly, the operation 102 may determine that it does not need to extract words such as “the,” “best,” etc., as they do not necessarily describe products and services of the company.
In one implementation, the operation 102 outputs a list of keywords extracted from the web pages for a company and the frequency of each of such keywords. For example, for a company selling footwear, the keywords may be “shoes,” “sandals,” “running,” etc. The frequency at which each of these keywords is extracted from the web pages may also be tabulated. In one implementation, the operation 102 may output a matrix of a large number of companies and keywords for each of these companies.
Subsequently, an operation 104 combines these web documents with other textual descriptions of companies, e.g. from 3rd party data vendors or other public data sources and company databases, to form a corpus of documents that describe companies. Thus, for example, the operation 104 may extract keywords from other source, such as a news article, a LinkedIn™ page, Wikipedia™ page about the company, a consumer product review website, AdWords purchased by the company, etc. Thus, in the example of a company selling footwear, the operation 104 may combine textual descriptions from such other sources—also referred to as the secondary sources. The output of the operation 104 is used to enhance the matrix generated at the operation 102.
Subsequently, an operation 106 extracts keyword phrases from the text descriptions and counts keyword phrases that appear for each company, forming a vector of term frequencies to represent each company, where a term is an n-gram, a chain of n words. Specifically, the operation 106 generates a list of candidate descriptive phrases that may provide a description of a company. For example, for the company selling footwear, one such phrase may be “running shoe”. Another such phrase may be “low-impact shoe”, etc. The operation 106 extracts such keyword phrases for the company and documents the frequency of each of these keyword phrases. In one implementation, the candidate descriptive phrases are generated by aggregating the keywords from the company web pages and from the secondary sources, meta keywords and meta descriptors from the company web pages and from the secondary sources. The operation 106 aggregates these descriptive phrases and for the company and generates the count of those phrases.
The descriptive phrases are also referred to as n-grams. For example, for a company selling footwear, a monogram may be “shoes”, a bi-gram may be “running shoe”, a tri-gram may be “altitude running shoe”, etc. The operation 106 generates the count for each such n-grams related to the company. In one implementation, the operation 106 generates the n-grams across the websites of the companies globally to determine the n-grams that are used more often to describe a company or a product. Each of the n-grams in this collection of n-grams is related to a count of how often the n-gram occurs. In one implementation, an n-gram having a higher count is ranked higher.
An operation 108 computes document frequencies (DF) for each considered phrases or n-grams across the entire corpus, defined as the number of companies whose text descriptions contained that phrase. In one implementation, to compute the DF for each n-gram, a website is considered one document.
Alternatively, each web page may be considered a document. Thus, if the n-gram “altitude running shoe” shows up on 300 web pages, including company pages, news sources, Wikipedia pages, etc., the n-gram “altitude running shoe” is given a document frequency of 300. In one implementation, the count may be evaluated as a percentage of the total documents in the universe. For example, if the system is evaluating a million documents, the document frequency of 300 may indicate the phrase “altitude running shoe” to be important and descriptive, while a word like “the” is deemed unimportant because it appears in nearly all million documents. In yet alternative implementation, each occurrence of an n-gram is given a weight based on the documents that the n-gram is from. For example, an n-gram appearing on a Wikipedia document may be given a higher weight compared to an n-gram appearing on a social network document.
To reduce the contribution of phrases that are very common, an operation 110 applies a term-frequency (TF)-inverse-document-frequency (IDF) (TF-IDF) transformation. Here, the term-frequency (TF) emphasizes phrases that appear multiple times within the document, while the inverse document frequency (IDF) de-emphasizes phrases that are common across documents, and emphasizes phrases that are rarer, more descriptive, or salient. For example, if an n-gram “running shoe” appears 10 times in a document, it has the TF of 10 for the document. On the other hand, if the n-gram “running shoe” is common across all documents, it may be a very common n-gram and its inverse frequency across all documents (IDF) de-emphasizes the importance of that n-gram. Thus, the TF is a frequency per document and the IDF is inversely proportional to the frequency across the entire corpus of documents. The TF may be generated based on output from the operation 106, whereas the IDF may be calculated based on the output of the operation 108.
The term-frequency function is a function that increases with the number of occurrences of an n-gram phrase in a document. An example is simply TF=term_count, while in a sublinear scaling example, TF=1+log(term_count). The inverse-document-frequency function is a function that decreases with the number of documents that contain the n-gram phrase. An example formulation is IDF=log(num_total_documents/num_documents_with_term). The TF-IDF is the multiplicative product of TF and IDF.
While the TF-IDF transformation is good at scaling individual terms independently based on occurrences within a document and occurrences across the corpus, it does not always work well for keyword ranking; the terms with the highest TF-IDF values are often not the terms that a human would consider to be most relevant descriptors of the company. The underlying problem is that TF-IDF does not take into consideration the co-occurrence of different keyword phrases within each document. The patterns of co-occurring words and phrases can be interpreted as “topics” within a document, and each company or document can be expected to focus on a few topics or themes. A human typically identifies relevant keywords by considering both the saliency of the keyword itself, and whether the keyword is “on topic” within the context of the document. For each document, the operation 110 outputs a list of key n-grams and a TF-IDF value for that key n-gram.
An operation 112 determines similarity of keywords. Specifically, the operation 112 determines how similar any two keywords are to each other. The method for determining similarity of keywords is further disclosed below with respect to
An operation 114 applies a relevance transform by boosting the TF-IDF value of phrases within each document based on how on-topic it is. One of the inputs for the operation 114 is the keyword similarity value generated at operation 112. A given document can be represented by n-grams and their corresponding strengths. Considering the co-occurrence of n-grams within the document, also allows extracting a set of topics, their strengths, and their associated influences to-and-from the individual n-grams. A generalized diagram of the n-gram and topic relationships is shown below in
where ri is the relevance of n-gram i, wi is the strength of n-gram i, eji is the influence or edge weight from topic j to n-gram i, ti is the strength of topic j, and k is the number of topics.
In one implementation, the topics can be selected to be the individual stemmed words that make up the n-grams. Stemming refers to the reduction of words to their word stem, base, or root form. For example, a bigram “mobile gaming” can be viewed as exhibiting two topics “mobile” and “game”, the stemmed forms of “mobile” and “gaming”. If there exist many other unique phrases that are comprised of words that stem to “mobile” and “game”, such as “mobile applications” or “gaming equipment”, then it would increase the topical strength of “mobile” and “game” within this document, and every phrase linked to these topics would get boosted in terms of relevance. One example function for assigning the topic strength is 1+log(degree) where degree is the number of edges from that topic to its associated n-grams within the document, or in other words, the number of unique n-grams that contain a word that stems to that topic. In this case, the edge weights can simply be 1.0 when there is an association between an n-gram and a topic, and 0.0 (no edge) when the n-gram does not contain a word that stems to the topic. A hypothetical example of this implementation, for a document about mobile gaming and game development 400 is shown below in
Similar Keywords and Phrases:
In
Furthermore, similarities between n-grams may also be used by a computer to determine the relevance scores. For example, the n-gram “mobile games” and “mobile gaming” may be determined to be similar, in which case, the co-occurrence of these two n-grams being similar to each other within one document can be used to boost the TF-IDF value of each of these two n-grams.
In other implementations, the topics, topic strengths, and n-gram-topic edge weights for each document can be extracted using techniques such as Latent Semantic Analysis, Latent Dirichlet Allocation, Hierarchical Dirichlet Processes, Non-negative Matrix Factorization, and others, or a combination of methods. Similar to before, the topical strength can also be used to amplify the associated individual n-gram strengths to form a measure of relevance for each n-gram.
The top-ranking keyword phrases by relevance score can be used as business tags that succinctly describe a company's business or products. The dataset supports lookups by company to find the company's descriptive tags (as shown below in
Keyword Relevance/Keyword to Company Search
In one implementation, an operation 116 generates relevance scores for various companies and keywords/phrases. For example, the operation 116 may produce, for each company, a ranked and scored list of keywords. Thus, for a particular footwear company the keyword “boots” maybe ranked higher than the term “sandal”, in which case, that particular company may be more likely to sell, specialize in, known for, etc., for boots compared to sandals. In one implementation, the operation 116 may determine such ranking based on the TF-IDF for the terms in the documents related to the company. For example, if the keyword “boots” appears in more documents for the particular footwear company compared to the keyword “sandals”, “boots” is ranked higher than “sandals” for that particular footwear company.
Similarly, the operation 116 may also produce for each keyword, a ranked and scored list of companies. Thus, for example, for the keyword “boot” a First Footwear Company may be ranked higher than a Second Footwear Company, which may signify that the First Footwear Company is more likely to sell, specialize in, known for, etc., for boots compared to the Second Footwear Company. In one implementation, the operation 116 may determine such ranking based on the TF-IDF for the term in the documents related to the companies. For example, if term “boots” appears more often in documents related to the First Footwear Company compared to the documents related to the Second Footwear Company, the First Footwear Company is ranked higher than the Second Footwear Company for the keyword “boots.” While the illustrated implementations of the operations 100 disclose the operation 114 for boosting the TF-IDF value and the operation 116 for determining keyword relevance, in alternative implementation, these operations may be combined.
Clustering and Cluster Tagging
The TF-IDF and Relevance based semantic representations of companies can be used to directly drive product applications as well as implicitly support downstream machine learning applications. In one machine learning application, Representation Learning techniques are applied by an operation 118 on the TF-IDF or relevance vectors to generalize or project companies in the high dimensional n-gram space into a lower dimensional topic space. Such techniques include using Singular Value Decomposition, Latent Dirichlet Allocation, Hierarchical Dirichlet Processes, Non-negative Matrix Factorization, Neural Network Autoencoders, and others. Companies that are close together in the topic space, e.g. according to Euclidean or Cosine distance, are effectively similar to each other in terms of their business, product offerings, solutions or services.
Given that similar companies are close together in the topic vector space, a clustering algorithm is also applied at an operation 120 to automatically segment a broad set of companies into subsets or groups of companies that are similar to each other. Such clustering techniques include, but are not limited to, K-Means, Spectral Clustering, DBSCAN, OPTICS, Hierarchical Clustering, and Affinity Propagation.
A technique disclosed herein also allows to automatically extract relevant n-gram keywords to describe each cluster of companies. For a cluster or any set of companies, the constituent companies' n-gram vector representations are merged into one n-gram vector via an aggregating function, a simple example of which is just the vector sum. From this merged n-gram vector, the relevance scoring algorithm described earlier is applied to boost the strengths of relevant n-grams, following the same principle that n-grams that are on-topic within the cluster should be considered more relevant. The top n-grams by relevance can be used to tag each cluster so that they are readily human understandable.
Visualization
Starting again from the notion that similar companies are close together in our semantic vector space, there is a lot of potential value in being able to visualize the clusters or segments of similar companies within a broad set of companies. The key requirement of the visualization technique is to be able to position entities that are close together in high dimensional space such that they are also close together in 2- or 3-dimensional space in order to preserve and visualize the similarity structure in an intuitive way. Some example techniques (sometimes referred to as manifold learning) that satisfy this requirement are t-Distributed Stochastic Neighbors Embedding (t-SNE) and Multi-Dimensional Scaling (MDS).
An operation 122 provides cluster visualization with business tags, such as the one illustrated below in
The disclosed technology provides a technique to address this issue, by post-processing the positions according to a set of desired node sizes for the entities. In one implementation, the non-overlap problem formulation for n points may be given by:
where xi is the final layout position vector for point i to be optimized, pi is the original position vector of point i, and ri is the desired radius of point i in the final visualization. Conceptually, the constraints are to ensure that no two circular points are overlapped, while the system tries to minimize the total movement of points away from their original positions.
The problem with the above formulation is that the constraints are not convex, thus it is not efficiently solvable. Therefore, a convex restriction is applied by modifying the constraints to ensure that any two points, in two dimensions for example, must be separated by a region defined by two parallel lines, both perpendicular to a directional constraint unit vector pointing in the direction from the original positions of point j to point i, whose width is at least ri+rj. This results in a smaller feasible set, leading to slightly suboptimal solutions to the above problem formulation, but the optimization problem becomes convex and can be efficiently solved as a Quadratic Program using, e.g. interior point methods or other standard convex solvers. To get closer to the optimal solution of the original problem, multiple iterations of this convex optimization are run by using the solutions xi of the previous run to set the directional constraint unit vector for the next run.
To further optimize the computational efficiency, a large number of the constraints can be removed without much consequence, because points that are originally far apart from each other most likely will not violate the non-overlap constraint even after applying node sizes. To this end, an implementation considers constraints only between each point and its k nearest neighbors where k is much smaller than the number of points.
The operations 220 and 222 together provide ability to look up companies by keywords and keywords by company. Thus, a user may input a keyword, such as “shoes” in a user interface and get a list of companies that are related to shoes. In one implementation, the list of companies is ranked as their relevance to the keyword “shoes”. Another operation 222 allows looking up business tag based on companies. Alternatively, a user may input a keyword, such as “Shoe Company A” in a user interface and get a list of keywords that are related to Shoe Company A. In one implementation, the list of keywords is ranked as their relevance to the “Shoe Company A.”
An example illustration of what the non-overlap algorithm 500 accomplishes is shown in
In one product application, the customers or prospects of a client are analyzed by clustering them based on our semantic representations, either in the keyword space or the more general topic space. These clusters each consist of companies that are similar in business offerings to each other, and human understandable business tags can be extracted using our cluster tagging and relevance scoring technique. In effect, the clusters can be considered to be micro-segments on which marketers can craft specialized messages and content which resonate well with the personas of the companies in each of the micro-segments, leading to improved conversion rates.
In a related product application, a visualization of the clusters can be shown along with the business tags describing each cluster, to use as an intuitive user interface for clients to get an overview of their customers or prospects, see
Another product application provides a search engine for companies based on the most relevant business tags that are extracted. This allows marketers and sales teams to quickly search through tens of millions of businesses for specific target segments or companies that may have similar needs. For example, assume we have a client who is a hard drive manufacturer and they have several exemplary customers that specialize in video surveillance. Video surveillance companies typically have a need for large amounts of hard drive storage to archive large video files. With the search engine, the client can easily find new “video surveillance” businesses that were previously unknown to them, and reach out in a highly personalized way with relevant and successful case studies of their exemplary customers.
In a synergistic application on this platform, the search engine results are ranked by their fit scores according to a client's own trained customer model. In one implementation, the keyword search is tailored for searching businesses, because the keywords are extracted from product pages and business descriptions, using a specialized relevance algorithm, therefore yielding much more accurate search results. By coupling with fit scores, the search engine results are both accurate with respect to the keyword query and simultaneously relevant to the client's specific business.
The system bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, a switched fabric, point-to-point connections, and a local bus using any of a variety of bus architectures. The system memory may also be referred to as simply the memory, and includes read only memory (ROM) 24 and random-access memory (RAM) 25. A basic input/output system (BIOS) 26, containing the basic routines that help to transfer information between elements within the computer 20, such as during start-up, is stored in ROM 24. The computer 20 further includes a hard disk drive 27 for reading from and writing to a hard disk, not shown, a magnetic disk drive 28 for reading from or writing to a removable magnetic disk 29, and an optical disk drive 30 for reading from or writing to a removable optical disk 31 such as a CD ROM, DVD, or other optical media.
The hard disk drive 27, magnetic disk drive 28, and optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 32, a magnetic disk drive interface 33, and an optical disk drive interface 34, respectively. The drives and their associated tangible computer-readable media provide nonvolatile storage of computer-readable instructions, data structures, program modules and other data for the computer 20. It should be appreciated by those skilled in the art that any type of tangible computer-readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, random access memories (RAMs), read only memories (ROMs), and the like, may be used in the example operating environment.
A number of program modules may be stored on the hard disk, removable magnetic disk 29, removable optical disk 31, ROM 24, or RAM 25, including an operating system 35, one or more application programs 36, other program modules 37, and program data 38. A user may enter commands and information into the personal computer 20 through input devices, such as a keyboard 40 and pointing device 42. Other input devices (not shown) may include a microphone (e.g., for voice input), a camera (e.g., for a natural user interface (NUI)), a joystick, a game pad, a satellite dish, a scanner, or the like. These and other input devices are often connected to the processing unit 21 through a serial port interface 46 that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port, or a universal serial bus (USB). A monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48. In addition to the monitor, computers typically include other peripheral output devices (not shown), such as speakers and printers.
The computer 20 may operate in a networked environment using logical connections to one or more remote computers, such as remote computer 49. These logical connections are achieved by a communication device coupled to or a part of the computer 20; the implementations are not limited to a particular type of communications device. The remote computer 49 may be another computer, a server, a router, a network PC, a client, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 20, although only a memory storage device 50 has been illustrated in
When used in a LAN-networking environment, the computer 20 is connected to the local network 51 through a network interface or adapter 53, which is one type of communications device. When used in a WAN-networking environment, the computer 20 typically includes a modem 54, a network adapter, a type of communications device, or any other type of communications device for establishing communications over the wide area network 52. The modem 54, which may be internal or external, is connected to the system bus 23 via the serial port interface 46. In a networked environment, program engines depicted relative to the personal computer 20, or portions thereof, may be stored in the remote memory storage device. It is appreciated that the network connections shown are example and other means of and communications devices for establishing a communications link between the computers may be used.
In an example implementation, software or firmware instructions and data for providing a search management system, various applications, search context pipelines, search services, service, a local file index, a local or remote application content index, a provider API, a contextual application launcher, and other instructions and data may be stored in memory 22 and/or storage devices 29 or 31 and processed by the processing unit 21.
Some embodiments may comprise an article of manufacture. An article of manufacture may comprise a tangible storage medium to store logic. Examples of a storage medium may include one or more types of computer-readable storage media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and so forth. Examples of the logic may include various software elements, such as software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. In one embodiment, for example, an article of manufacture may store executable computer program instructions that, when executed by a computer, cause the computer to perform methods and/or operations in accordance with the described embodiments. The executable computer program instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. The executable computer program instructions may be implemented according to a predefined computer language, manner or syntax, for instructing a computer to perform a certain function. The instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
One implementation disclosed herein uses factorization, with dimensionality reduction, of the positive point-wise mutual information (PPMI) matrix of keyword phrase to “context” word co-occurrences. Context words are words that appear around the keyword phrases in natural language sentences, documents, or conversations. The reasoning being that two keyword phrases are similar if they have similar word contexts. To further capture and distinguish between longer distance versus shorter distance contextual semantics, the context words can be segregated by zones or regions of distance away from the central keyword phrase. A diagram illustrating 3 zones is illustrated by 902.
Subsequently, the system disclosed herein parameterizes the context zones by the window size, which defines how many word positions fall into the zone, and by the zone offset, which defines how many positions to shift the zone away from the central keyword phrase. In the example illustrated by 902, symmetric zones to the left and right of the keyword phrase are treated together but in other implementations, zones to the left versus to the right of the keyword phrase may be tracked separately, as well.
Subsequently, the system disclosed herein forms the co-occurrence matrix of keyword phrase to context words by counting the occurrences of each pair of (w, c), where w is the keyword phrase and c is a context word within a specific zone. In some implementations, the contribution of a co-occurring pair may be weighted by the position within the zone or distance from the central keyword phrase. The co-occurrence values are aggregated over a large corpus of natural language text documents, such as news articles, crawled websites, and Wikipedia articles. The aggregated values are stored in a keyword-context matrix are illustrated at 904 for the example of three context zones.
The raw co-occurrence values are not a good measure of key-phrase to context word association because certain words and phrases naturally occur more frequently than others. Instead, an implementation disclosed herein uses point-wise mutual information to measure how informative a context word is about a target key-phrase. For each cell in the matrix the system computes the point-wise mutual information as:
where p(w, c) is the probability of the co-occurring keyword phrase and context word, p(w) is the probability of observing the keyword phrase, and p(c) is the probability of observing the context word. Larger positive PMI values mean that the words co-occur more than if they were independent. In practice, negative values are unreliable when dealing with extremely small probabilities and require large amounts of text and evidence, therefore in some implementations, only positive PMI values are considered, and negative values are replaced with 0 using:
ppmi(w,c)=max(0,pmi(w,c)).
In some implementations, the p(c) term is also modified to give rare context words higher probabilities because very rare words can skew PMI to large values, resulting in worse performance in the downstream semantic similarity tasks. One example modification is:
where the context counts are scaled to a power a that is between 0 and 1, which has the effect of increasing the probability of rare context words. Another possible modification is add-k smoothing, which modifies each count(c) by the addition of a positive value k, thus raising the minimum count of rare words.
Once the matrix of PPMI values is formed, it is factorized, e.g., using Singular Value Decomposition, into a key-phrase-to-latent topic matrix multiplying a latent topic-to-context matrix. The rows of the key-phrase-to-latent topic matrix are the desired key-phrase vectors from which the system computes similarities between every pair of keyword phrases.
The above paragraphs disclose only one technique to produce key-phrase vectors from which similarities may be computed. Other word embedding techniques include CBOW Word2Vec, Skip-gram Word2Vec, or GloVe, though they may be used on single words rather than keyword phrases.
Using the similarity measure between all key-phrases, an implementation disclosed herein produces a list of most similar key-phrases for every key-phrase. This additional dataset synergizes well with other offerings on the system disclosed herein, particularly enhancing the ability for users to search for companies using key-phrases which we have algorithmically tagged companies with (described in other sections of this patent). Using the outputs of the keyword similarity computation, the system disclosed herein suggests related and similar keywords for the user to add to their query. For example, when a user searches for “artificial intelligence” companies, we can automatically suggest additional queries on “machine learning”, “deep learning”, “computer vision”, and “ai”. This greatly reduces the burden on users to recall or think of all possible variants of similar key-phrase queries, and may even introduce new concepts or terms that the user was not aware of.
Another application of the similarity measure between key-phrases is the enhancement of the algorithm described above for automatically tagging companies with keywords that describe the company's business.
Subsequently, an operation 1012 forms co-occurrence matrix of keyword phrase to context words by counting the occurrences of each pair of (w, c), where w is the keyword phrase and c is a context word within a specific zone. An operation 1014 aggregates co-occurrence values over a large corpus of natural language text documents, such as news articles, crawled websites, and Wikipedia articles. The aggregated values are stored in a keyword-context matrix at an operation 1016 as illustrated at 904.
An operation 1018 modifies the p(c) term to give rare context words higher probabilities because very rare words can skew PMI to large values, resulting in worse performance in the downstream semantic similarity tasks. Subsequently, an operation 1020 computes point-wise mutual information pmi (w, c). In some implementations, only the positive PMI values pmmi are considered, and negative values are replaced with 0. Subsequently, an operation 1022 factorizes the matrix of PPMI using Singular Value Decomposition, into a key-phrase-to-latent topic matrix multiplying a latent topic-to-context matrix. An operation 1024 computes similarities between pair of keyword phrases using the rows of the key-phrase-to-latent topic matrix.
The implementations described herein are implemented as logical steps in one or more computer systems. The logical operations may be implemented (1) as a sequence of processor-implemented steps executing in one or more computer systems and (2) as interconnected machine or circuit modules within one or more computer systems. The implementation is a matter of choice, dependent on the performance requirements of the computer system being utilized. Accordingly, the logical operations making up the implementations described herein are referred to variously as operations, steps, objects, or modules. Furthermore, it should be understood that logical operations may be performed in any order, unless explicitly claimed otherwise or a specific order is inherently necessitated by the claim language.
The above specification, examples, and data provide a complete description of the structure and use of exemplary implementations. Since many implementations can be made without departing from the spirit and scope of the claimed invention, the claims hereinafter appended define the invention. Furthermore, structural features of the different examples may be combined in yet another implementation without departing from the recited claims.
Embodiments of the present technology are disclosed herein in the context of an electronic market system. In the above description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. For example, while various features are ascribed to particular embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments, as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to the invention, as other embodiments of the invention may omit such features.
In the interest of clarity, not all of the routine functions of the implementations described herein are shown and described. It will, of course, be appreciated that in the development of any such actual implementation, numerous implementation-specific decisions must be made in order to achieve the developer's specific goals, such as compliance with application—and business-related constraints, and that those specific goals will vary from one implementation to another and from one developer to another.
According to one embodiment of the present invention, the components, process steps, and/or data structures disclosed herein may be implemented using various types of operating systems (OS), computing platforms, firmware, computer programs, computer languages, and/or general-purpose machines. The method can be run as a programmed process running on processing circuitry. The processing circuitry can take the form of numerous combinations of processors and operating systems, connections and networks, data stores, or a stand-alone device. The process can be implemented as instructions executed by such hardware, hardware alone, or any combination thereof. The software may be stored on a program storage device readable by a machine.
According to one embodiment of the present invention, the components, processes, and/or data structures may be implemented using machine language, assembler, C or C++, Java and/or other high level language programs running on a data processing computer such as a personal computer, workstation computer, mainframe computer, or high performance server running an OS such as Solaris® available from Sun Microsystems, Inc. of Santa Clara, Calif., Windows Vista™, Windows NT®, Windows XP PRO, and Windows® 2000, available from Microsoft Corporation of Redmond, Wash., Apple OS X-based systems, available from Apple Inc. of Cupertino, Calif., or various versions of the Unix operating system such as Linux available from a number of vendors. The method may also be implemented on a multiple-processor system, or in a computing environment including various peripherals such as input devices, output devices, displays, pointing devices, memories, storage devices, media interfaces for transferring data to and from the processor(s), and the like. In addition, such a computer system or computing environment may be networked locally, or over the Internet or other networks. Different implementations may be used and may include other types of operating systems, computing platforms, computer programs, firmware, computer languages and/or general-purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general-purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herein.
In the context of the present invention, the term “processor” describes a physical computer (either stand-alone or distributed) or a virtual machine (either stand-alone or distributed) that processes or transforms data. The processor may be implemented in hardware, software, firmware, or a combination thereof.
In the context of the present technology, the term “data store” describes a hardware and/or software means or apparatus, either local or distributed, for storing digital or analog information or data. The term “Data store” describes, by way of example, any such devices as random access memory (RAM), read-only memory (ROM), dynamic random access memory (DRAM), static dynamic random access memory (SDRAM), Hash memory, hard drives, disk drives, floppy drives, tape drives, CD drives, DVD drives, magnetic tape devices (audio, visual, analog, digital, or a combination thereof), optical storage devices, electrically erasable programmable read-only memory (EEPROM), solid state memory devices and Universal Serial Bus (USB) storage devices, and the like. The term “Data store” also describes, by way of example, databases, file systems, record systems, object oriented databases, relational databases, SQL databases, audit trails and logs, program memory, cache and buffers, and the like.
The above specification, examples and data provide a complete description of the structure and use of exemplary embodiments of the invention. Although various embodiments of the invention have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from the spirit or scope of this invention. In particular, it should be understood that the described technology may be employed independent of a personal computer. Other embodiments are therefore contemplated. It is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative only of particular embodiments and not limiting. Changes in detail or structure may be made without departing from the basic elements of the invention as defined in the following claims.
Claims
1. A computer-implemented method, wherein one or more computing devices comprising storage and a processor are programmed to perform steps comprising:
- generating a count for keyword phrases and topics extracted from a corpus of documents, the topics being associated with the extracted keyword phrases or a portion of the extracted keyword phrases;
- determining document frequencies (DF) for each extracted keyword phrase across the corpus of documents; applying a term-frequency (TF)-inverse-document-frequency (IDF) (TF-IDF) transformation to each of the extracted keyword phrases to generate a respective plurality of TF-IDF vectors;
- determining a strength of each topic based on a number of extracted keyword phrases associated with that respective topic; determining an edge weight based on a linkage of the topic with an associated extracted keyword phrase;
- generating relevance scores relating each extracted keyword phrase to the respective company based on a strength of each of the extracted keyword phrases, the strength of each topic, and the edge weight for each topic, the strength of each of the extracted keyword phrases being equal to a TF-IDF vector associated with the extracted keyword phrase;
- applying a representation learning technique to the plurality of TF-IDF vectors and the relevance scores to generalize each respective company into at least one of a plurality of topic spaces;
- segmenting the plurality of companies into clusters by applying a clustering technique to the extracted keyword phrases for each respective company or to the plurality of topic spaces; and
- outputting the clusters of companies with respective business tags.
2. The method of claim 1, further comprising generating similarity between two of the extracted keyword phrases based on a distance metric between the two extracted keyword phrases and determining the strength of each topic based on the number of extracted keyword phrases and the similarity between the two extracted keyword phrases.
3. The method of claim 2, wherein the distance metric includes a distance between the two extracted keyword phrases as either cosine distance or Euclidian distance.
4. The method of claim 1, further comprising generating similarity between two of the extracted keyword phrases based on a positive point-wise mutual information (PPMI) matrix of the two extracted keyword phrases to context words and determining the strength of each topic based on the number of extracted keyword phrases and the similarity between the two extracted keyword phrases.
5. The method of claim 4, further comprising segregating the context words by regions of distances away from a central keyword phrase.
6. The method of claim 4, further comprising generating a co-occurrence matrix of the two extracted keyword phrases to context words by counting the occurrences of each pair of (w, c), wherein w is the extracted keyword phrase and c is a context word within a specific zone.
7. The method of claim 1, further comprising segmenting the plurality of companies into a first cluster and a second, overlapping cluster.
8. The method of claim 1, further comprising segmenting the plurality of companies into a first cluster and a second, non-overlapping cluster.
9. The method of claim 1, further comprising segmenting the plurality of companies into a first cluster and a second cluster that is larger than the first cluster.
10. The method of claim 1, further comprising segmenting the plurality of companies into a first cluster and a second cluster that is approximately the same size as the first cluster.
11. The method of claim 1, further comprising segmenting the plurality of companies into a first cluster and a second cluster, and extracting keywords for each of the first cluster and the second cluster.
12. The method of claim 11, further comprising generating the relevance scores relating to the first cluster and the second cluster based on a strength of each of the extracted keywords for the first cluster and the second cluster, respectively, and outputting the relevance scores relating to the first cluster and the second cluster.
13. A system, comprising:
- a processor configured to:
- generate a count for keyword phrases and topics extracted from a corpus of documents, the topics being associated with the extracted keyword phrases or a portion of the extracted keyword phrases;
- determine document frequencies (DF) for each extracted keyword phrase across the corpus of documents;
- apply a term-frequency (TF)-inverse-document-frequency (IDF) (TF-IDF) transformation to each of the extracted keyword phrases to generate a respective plurality of TF-IDF vectors;
- determine a strength of each topic based on a number of extracted keyword phrases associated with that respective topic;
- determine an edge weight based on a linkage of the topic with an associated extracted keyword phrase;
- generate relevance scores relating each extracted keyword phrase to the respective company based on a strength of each of the extracted keyword phrases, the strength of each topic, and the edge weight for each topic, the strength of each of the extracted keyword phrases being equal to a TF-IDF vector associated with the extracted keyword phrase;
- apply a representation learning technique to the plurality of TF-IDF vectors and the relevance scores to generalize each respective company into at least one of a plurality of topic spaces;
- create segments of the plurality of companies into clusters by applying a clustering technique to the extracted keyword phrases for each respective company or to the plurality of topic spaces; and
- an output configured to transmit the clusters of companies with respective business tags to another computing device, network, or system.
14. The system of claim 13, wherein the processor is further comprised to generate similarity between two of the extracted keyword phrases based on a distance metric between the two extracted keyword phrases and determine the strength of each topic based on the number of extracted keyword phrases and the similarity between the two extracted keyword phrases.
15. The system of claim 13, wherein the processor is further configured to generate similarity between two of the extracted keyword phrases based on a positive point-wise mutual information (PPMI) matrix of the two extracted keyword phrases to context words and determine the strength of each topic based on the number of extracted keyword phrases and the similarity between the two extracted keyword phrases.
16. The system of claim 15, wherein the processor is further configured to segregate the context words by regions of distances away from a central keyword phrase.
17. The system of claim 15, wherein the processor is further configured to generate a co-occurrence matrix of the two extracted keyword phrases to context words by counting the occurrences of each pair of (w, c), wherein w is the extracted keyword phrase and c is a context word within a specific zone.
18. The system of claim 13, wherein the processor is further configured to create segments of the plurality of companies into a first cluster and a second, overlapping cluster or a first cluster and a second, non-overlapping cluster.
19. The system of claim 13, wherein the processor is further configured to create segments of the plurality of companies into a first cluster and a second cluster that is larger than the first cluster or approximately the same size as the first cluster.
20. The system of claim 13, wherein:
- the processor is further configured to:
- create segments of the plurality of companies into a first cluster and a second cluster,
- extract keywords for each of the first cluster and the second cluster, and
- generate t h e relevance scores relating to the first cluster and the second cluster based on a strength of each of the extracted keywords for the first cluster and the second cluster, respectively, and
- the output is further configured to output the relevance scores relating to the first cluster and the second cluster.