MESSAGE RECOMMENDATION USING WORD ISOLATION AND CLUSTERING
Network system provides a real-time adaptive recommendation set of documents with a high statistical measure of relevancy to the requestor device. The recommendation set is optimized based on analyzing text of documents of the interest set, categorizing these documents into clusters, extracting keywords representing the themes or concepts of documents in the clusters, and filtering a population of eligible documents accessible to the system utilizing site and or Internet-wide search engines. The system is either automatically or manually invoked and it develops and presents the recommendation set in real-time. The recommendation set may be presented as a greeting, notification, alert, HTML fragment, fax, voicemail, or automatic classification or routing of customer e-mail, personal e-mail, job postings, and offers for sale or exchange.
The present application is a continuation and claims the priority benefit of co-pending U.S. patent application Ser. No. 11/927,450 filed Oct. 29, 2007, set to issue as U.S. Pat. No. 9,245,013 on Jan. 26, 2016, which is a continuation and claims the priority benefit of U.S. patent application Ser. No. 11/003,920 filed Dec. 3, 2004, now U.S. Pat. No. 8,645,389 issued Feb. 4, 2014, which is a continuation and claims the priority benefit of U.S. patent application Ser. No. 09/723,855 filed Nov. 27, 2000, now U.S. Pat. No. 6,845,374 issued Jan. 18, 2005, the disclosures of which are incorporated herein by reference. The present application is further related to U.S. patent application Ser. No. 14/172,731 filed Feb. 4, 2014, now U.S. Pat. No. 9,152,704 issued Oct. 6, 2015, the disclosure of which is incorporated herein by reference, which claims the priority benefit of U.S. patent application Ser. No. 09/723,855 filed Nov. 27, 2000, now U.S. Pat. No. 6,845,374 issued Jan. 18, 2005.
BACKGROUND OF THE INVENTION1. Field of the Invention
Invention relates to a method and system for recommending relevant items to a user of an electronic network. More particularly, the present invention relates to a means of analyzing the text of documents of interest and recommending a set of documents with a high measure of statistical relevancy.
2. Description of the Related Art
Most personalization and web user analysis (also known as “clickstream”) technologies work with the system making a record of select web pages that a user has viewed, typically in a web log. A web log entry records which users looked at which web pages in the site. A typical web log entry consist of two major pieces of information, namely, first, some form of user identifier such as an IP address, a cookie ID, or a session ID, and second, some form of page identifier such as a URL, file name, or product number. Additional information may be included such as the page the user came from to get to the page and the time when the user requested the page. The web log entry records are collected in a file system of a web server and analyzed using software to produce charts of page requests per day or most visited pages, etc. Such software typically relies on simple aggregations and summarizations of page requests rather than any analysis of the internal page structure and content.
Other personalization software also relies on the concept of web logs. The dominant technology is collaborative filtering, which works by observing the pages of the web site a user requests, searching for other users that have made similar requests, and suggesting pages that these other users requested. For example, if a user requests pages 1 and 2, a collaborative filtering system would find others who did the same. If the other users on the average also requested pages 3 and 4, a collaborative system would offer pages 3 and 4 as a best recommendation. Other collaborative filtering systems use statistical techniques to perform frequency analysis and more sophisticated prediction techniques using methods such as neural networks. Examples of collaborative filtering systems include NETPERCEPTIONS™, LIKE MINDS™, and WISEWIRE™. Such a system in action can be viewed at AMAZON.COM™.
Other types of collaborative filtering systems allow users to rank their interest in a group of documents. User answers are collected to develop a user profile that is compared to other user profiles. The document viewed by others with the same profile is recommended to the user. This approach may use artificial intelligence techniques such as incremental learning methods to improve the recommendations based on user feedback. Systems using this approach include SITEHELPER™, SYSKILL & EBERT™, FAB™, LIBRA™, and WEBWATCHER™. However collaborative filtering is ineffective to personalize documents with dynamic or unstructured content. For example, each auction in an auction web site or item offered in a swap web site is different and may have no logged history of previous users to which collaborative filtering can be applied. Collaborative filtering is also not effective for infrequently viewed documents or offerings of interest to only a few site visitors.
Clearly, there is a need for a system that considers not only the identifiers of the pages the user viewed but also the words in the pages viewed in order to make more focused recommendations to the user. Broadening the concept of pages to documents in general, there is a need for a recommendation system that analyzes the words in the document a user has expressed interest in. Such a recommendation system should support options of residing in the same computer as the web site, or on a remote server, or on an end user's computer. Furthermore, the system should be able to access documents from external sources such as from other web sites throughout the Internet or from private networks. A flexible recommendation system should also support a scalable architecture of using a proprietary text search engine or leverage off the search engines of other web sites or generalized Internet-wide search engines.
SUMMARYInvention discloses methods and systems for adaptively selecting relevant documents to present to a requester. A requestor device, either a client working on a PC, or a software program running on a server, automatically or manually invokes the adaptive text recommendation system (ATRS) and based on extracted keywords from the text of related documents, a set of relevant documents is presented to the requester. The set of recommended documents is continually updated as more documents are added to the set of related documents or interest set. ATRS adapts the choice of recommended documents based on new analysis of text contained in the interest set, categorizing the documents into clusters, extracting the keywords that capture the theme or concept of the documents in each cluster, and filtering the entire set of eligible documents in the application web site and or other web sites to compile the set of recommended documents with a high measure of statistical relevancy.
One embodiment is an application of ATRS in an e-commerce site, such as a seller of goods or services or an auction web site. A client logging onto an e-commerce site is greeted with a recommended set of relevant goods, services, or auction items by analyzing the text of the documents representing items previously bought, ordered, or bid on. As the client selects an item from the recommended set or an item on the web page, ATRS updates the documents in the interest set, categorizes the documents in the interest set into clusters, extracts keywords from the clusters, and filters the eligible set of documents at the web site to construct a recommended set. This recommended set of documents is rebuilt possibly every time the client makes a new selection or moves to a different web page.
The recommended set of documents may be presented as a panel or HTML fragment in a web page being viewed. The recommendations may be ordered for example by the statistical measure of relevancy or by popularity of the item and filtered based on information about the client.
In an alternate embodiment, ATRS may be invoked automatically by a software program to develop a recommended set for existing clients not currently logged on. The recommendations may take the form of a notification of select clients for sales, special events, or promotions. In other alternate embodiments, the recommendations may take the form of a client alert or “push” technology data feed. Similarly, other applications of ATRS include notification of clients of upcoming television shows, entertainment, or job postings based on the analysis of the text of documents associated with these shows, entertainment or job openings in which the client has indicated previous interest.
Additional applications of ATRS include automatic classification of personal e-mail, and automatic routing of customer relations e-mail to representatives who previously successfully resolved similar types of e-mail. The recommended set may also consist of Internet bookmarks or subscriptions to publications for a “community of interest” group. Furthermore, the recommended set may be transmitted as a fax, converted to audio, video, or an alert on a pager or PDA and transmitted to the requester.
The present invention can be applied to data in general, wherein a requester device issues a request for recommended data comprising documents, audio files, video files or multimedia files and an adaptive data recommendation system would return a recommended set of such data.
In one claimed embodiment, a method for recommending email messages for further user action includes storing a plurality of email messages in memory of a computing device. The email messages have been previously responded to by a user. The computing device includes a processor and executable instructions stored in the memory of the computing device. The method includes executing the instructions stored in the memory. Upon execution of the instructions by the processor, the computing device computes similarity scores between the email messages previously responded to by the user. Each of the similarity scores indicates a level of similarity between one or more words in a first email message from among the email messages and one or more words in a second email message from among the email messages. The computing device groups the email messages into clusters based on the computed similarity scores and recommends an email message received subsequent to the clustering of the email messages for further user action. Recommending the subsequently received email message includes calculating a relevance score for the subsequently received email message based on the clusters and one or more words in the subsequently received email message.
The Assembly Module 10 assembles documents from multiple sources into an interest set. Documents in the interest set may include documents in a database considered of interest to the requester, web site pages previously viewed by the requestor in the application web site or other web sites, documents selected by the requestor from a list obtained by a search in the application web site or by an Internet-wide search, e-mail sent by the requester, documents transmitted from a remote source such as those maintained in remote servers or in other private network databases, and documents sent by fax, scanned or input into any type of computer and made available to the Assembly Module 10. For example, in an auction site, the client, presented with a list of live auction items, clicks on several auction items that are of interest, then invokes ATRS to show a set of recommended auction items.
The Pre-processing Module 30 isolates the words in the interest set and removes words that are not useful for distinguishing one document from another document. Words removed are common words in the language and non-significant words to a specific application of ATRS.
The Clustering Module 40 groups the documents whose words have a high degree of similarity into clusters.
The Keyword Extraction Module 50 determines the keyword score for each word in a cluster and selects as keywords for the cluster words with the highest keyword score and that also appear in a minimum number of documents specified for the application.
The Filtration Module 60 uses application parameters for assembling documents considered eligible for recommendation. Eligible documents may include documents from enterprise databases, documents from private network databases, documents from the application web site, and documents from public networks, such as the Internet. Furthermore, these documents may cover subjects in many fields including but not limited to finance, law, medicine, business, environment, education, science, and venture capital. Application parameters may include age of documents and or client data that specify inclusion or exclusion of certain documents.
The Recommendation Module 80 calculates the relevance score for eligible documents to a cluster and ranks the eligible documents by relevance score and other application criteria. Top scoring documents are further filtered by criteria specific to the client.
The Presentation Module 90 personalizes the presentation format of the recommendations for the client. Examples of formats are e-mail, greetings to a site visitor, HTML fragment or a list of Internet sites. Any special sorting or additional filtration for the client is applied. The recommendations are converted to the desired medium, such as voicemail, fax hardcopy, file transfer transmission, or audio/video alert.
similarity(D1, D2)=0.
If the two documents have words in common, then:
where count (w, D) denotes the number of occurrences of the word w in the document D, and wD1 ∩D2 denotes a word that appears in both D1 and D2. Many other definitions of similarity between two documents are possible.
The clustering criteria may vary depending on the application of ATRS 4. An advantageous implementation involves arranging the documents from the interest set so as to maximize the cluster score, wherein the cluster score of a cluster containing only one document is zero and the cluster score for a cluster containing more than one to document is the average similarity score between the documents in the cluster.
The clustering algorithm can be any one of well-known clustering algorithms that can be applied to maximize the clustering criterion, such as K-Means, Single-Pass, or Buckshot, which are incorporated by reference.
Keyword score(w,C)=log Frequency(w,C)−log Frequency(w)
Select keywords for cluster C based on application criteria 184; for example, select keywords that have high scores and appear in several documents. Upon processing all clusters 186, the system proceeds to the balance of processing. In an alternative embodiment of the present invention, the keywords describing the theme or concept in a cluster do not necessarily appear in the text of any document, but instead summarize the theme or concept determined, for example, by a method for natural language understanding.
where w keywords(C) denotes one of the keywords of cluster C.
Rank eligible documents by relevance score and other application criteria 194. Retain top scoring documents and apply other filtration criteria specific to this client 196. For example, the client may only want documents created within the last seven days. At the completion of all clusters 198, the system proceeds to the balance of processing.
The presentation of recommendations may be through a set ordered by relevance score, set ordered by popularity of document, a greeting to a site visitor, a notification of a sale, event, or promotion, a client alert, for example, a sound indicating presence of a new document, or a new article obtained from a newswire as in “push” data feed delivery methods, notification of TV shows and entertainment based on processing the descriptions of previously viewed TV programs or purchased tickets for entertainment shows. Hard copy formats in the form of postcards, letters, or fliers may also be the medium of presentation.
Another embodiment of the present invention is conversion of the recommendation set of documents into files for faxing to the client, conversion to voice and presenting it as a voicemail, a pager or audio or video alert for the client. Advantageously, such recommendations can be sent through a network and stored for later retrieval. In another embodiment, the system may serve a “community of interest” like a wine connoisseur's Internet list or chat room where the recommendation may consist of the popular magazines or web pages viewed by experts of the community of interest. Alternatively, the recommendation may be presented to the client or requester as a set of Internet bookmarks.
There are several alternative embodiments of the present invention. In a document classification application, customer e-mails sent to a company's customer service representative (CSR) department can be routed to the CSR that had successfully resolved similar e-mails containing the same issues. A similar application is the automatic classification of personal e-mail wherein ATRS processes e-mails read and or responded to by the client, applying the clustering/keyword extraction/filtering/recommending steps to present the recommended e-mails to the client, treating the rest as miscellaneous. The client may further specify presentation of the top ten e-mails only, a very useful feature for e-mail access on wireless devices. Other classification applications are automatic routing of job postings to a job category, and automatic classification of classified advertisements or offers for sale or offers to swap items or services.
Other applications of ATRS involve research either in the Internet or in enterprise databases. For example, a client may be interested in “banking”. Instead of sifting through multitudes of documents that contains “banking”, the client may “mark” several documents and invoke ATRS to present a set of recommended documents with a high measure of statistical relevance. This research may be invoked on a periodic basis wherein ATRS presents the recommended set of documents to the client in the form of a notification or to clients in the “community of interest” application.
In another application of ATRS, online auction participants who have lost an auction are sent e-mail or other notification containing a list of auctions that are similar to the one they lost. This list is generated based on textual analysis of the description of the lost auction.
Another application of ATRS involves analyzing the text of news stories or other content being viewed by a site visitor and displaying a list of products whose descriptions contain similar themes or concepts. For example, a visitor to a web site featuring stories about pop stars might read an article about Madonna and be presented a list of Madonna-related products such as musical recordings, clothing, etc. The presentation of the recommended products might be done immediately as the site visitor is browsing, or upon returning to the web site, or in an e-mail, or other delayed form of notification.
Similarly, ATRS can work in conjunction with a regular search engine to narrow the results to a more precise recommended set of documents. In one embodiment, ATRS 4 is a front-end system of a network search engine. ATRS 4 analyzes the text of an interest set of documents, groups the interest set of documents into clusters; extracts keywords from the text of the documents grouped into the clusters; and communicates the selected keywords of the clusters to the search engine. The search engine uses these keywords to search the network for documents that matches the keywords and other filtering criteria that may be set up for the application.
One implementation of the present invention is on a Linux OS running Apache web server with a MySQL database. However, a person knowledgeable in the art will readily recognize that the present invention can be implemented in different operating systems, different web servers with other types of data bases but not limited to Oracle and Informix.
A person knowledgeable in the art will readily recognize that the present invention can be implemented in a portable device comprising a controller; memory; storage; input accessories such a keyboard, pressure-sensitive pad, or voice recognition equipment; a display for presenting the recommended set; and communications equipment to wirelessly-connect the portable device to an information network. In one embodiment, the ATRS computer readable code can be loaded into the portable device by disk, tape, or a hardware plug-in, or downloaded from a site. In another embodiment, the logic and principles of the present invention can be designed and implemented in the circuitry of the portable device.
Foregoing described embodiments of the invention are provided as illustrations and descriptions. They are not intended to limit the invention to precise form described. In particular, it is contemplated that functional implementation of the invention described herein may be implemented equivalently in hardware, software, firmware, and/or other available functional components or building blocks.
Other variations and embodiments are possible in light of above teachings, and it is thus intended that the scope of invention not be limited by this Detailed Description, but rather by claims following.
Claims
1. A method for recommending email messages for further user action, the method comprising:
- storing a plurality of email messages in memory of a computing device, the email messages having been previously responded to by a user and the computing device including a processor and executable instructions stored in the memory; and
- executing the instructions stored in the memory, wherein execution of the instructions by the processor: computes a plurality of similarity scores between the email messages previously responded to by the user, each of the similarity scores indicating a level of similarity between one or more words in a first email message from among the plurality of email messages and one or more words in a second email message from among the plurality of email messages, groups the plurality of email messages into a plurality of clusters based on the computed similarity scores, and recommends an email message received subsequent to the clustering of the email messages for further user action, wherein recommending the subsequently received email message includes calculating a relevance score for the subsequently received email message based on the plurality of clusters and one or more words in the subsequently received email message.
2. The method of claim 1, wherein execution of the instructions by the processor further includes performing pre-processing on the email messages.
3. The method of claim 1, wherein the preprocessing includes converting the email messages into a common format.
4. The method of claim 1, wherein the preprocessing includes removing non keywords that do not facilitate grouping an email message into a particular group, and wherein the keywords that are not removed from the email message are used for computing the similarity score.
5. The method of claim 4, wherein keywords for a particular group are based on frequency of a particular word appearing email messages of the group.
6. The method of claim 4, wherein keywords for a particular group are based on a summary or concept of the group.
7. The method of claim 1, wherein the groups of emails having a similar computed score correspond to an interest set from which a corresponding set of information, stored in a database, can be used for the recommendation.
8. A system for recommending email messages for further user action, the system comprising:
- an assembly module that stores a plurality of email messages in memory of a computing device, the email messages having been previously responded to by a user and the computing device including a processor and executable instructions stored in the memory;
- a pre-processing module that computes a plurality of similarity scores between the email messages previously responded to by the user, each of the similarity scores indicating a level of similarity between one or more words in a first email message from among the plurality of email messages and one or more words in a second email message from among the plurality of email messages,
- a clustering module that groups the plurality of email messages into a plurality of clusters based on the computed similarity scores, and
- a recommendation module that recommends an email message received subsequent to the clustering of the email messages for further user action, wherein recommending the subsequently received email message includes calculating a relevance score for the subsequently received email message based on the plurality of clusters and one or more words in the subsequently received email message.
9. The system of claim 8, wherein the pre-processing module further performs one or more pre-processing processes on the email messages.
10. The system of claim 8, wherein the pre-processing processes include converting the email messages into a common format.
11. The system of claim 8, wherein the pre-processing processes include removing non keywords that do not facilitate grouping an email message into a particular group, and wherein the keywords that are not removed from the email message are used for computing the similarity score.
12. The system of claim 11, wherein keywords for a particular group are based on frequency of a particular word appearing email messages of the group.
13. The system of claim 11, wherein keywords for a particular group are based on a summary or concept of the group.
14. The system of claim 8, wherein the groups of emails having a similar computed score correspond to an interest set from which a corresponding set of information, stored in a database, can be used for the recommendation.
15. A non-transitory computer-readable storage medium, having embodied thereon a program executable by a processor to perform a method for recommending email messages for further user action, the method comprising:
- storing a plurality of email messages in memory of a computing device, the email messages having been previously responded to by a user and the computing device including a processor and executable instructions stored in the memory;
- computing a plurality of similarity scores between the email messages previously responded to by the user, each of the similarity scores indicating a level of similarity between one or more words in a first email message from among the plurality of email messages and one or more words in a second email message from among the plurality of email messages;
- grouping the plurality of email messages into a plurality of clusters based on the computed similarity scores; and
- recommending an email message received subsequent to the clustering of the email messages for further user action, wherein recommending the subsequently received email message includes calculating a relevance score for the subsequently received email message based on the plurality of clusters and one or more words in the subsequently received email message.
16. The non-transitory computer-readable storage medium of claim 15, wherein the method further includes performing pre-processing on the email messages.
17. The non-transitory computer-readable storage medium of claim 15, wherein the preprocessing includes converting the email messages into a common format.
18. The non-transitory computer-readable storage medium of claim 15, wherein the preprocessing includes removing non keywords that do not facilitate grouping an email message into a particular group, and wherein the keywords that are not removed from the email message are used for computing the similarity score.
19. The non-transitory computer-readable storage medium of claim 18, wherein keywords for a particular group are based on frequency of a particular word appearing email messages of the group.
20. The non-transitory computer-readable storage medium of claim 18, wherein keywords for a particular group are based on a summary or concept of the group.
21. The non-transitory computer-readable storage medium of claim 15, wherein the groups of emails having a similar computed score correspond to an interest set from which a corresponding set of information, stored in a database, can be used for the recommendation.
Type: Application
Filed: Jan 26, 2016
Publication Date: Aug 4, 2016
Inventors: Jonathan James Oliver (San Jose, CA), Wray Lindsay Buntine (Berkley, CA), George Roumeliotis (Menlo Park, CA)
Application Number: 15/006,933