Patents by Inventor Wen-tau Yih

Wen-tau Yih has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Document summarization by maximizing informative content words

Patent number: 7702680

Abstract: Document summarization is performed by scoring individual words in sentences in a document or document cluster. Sentences from the document or document cluster are selected to form a summary based on the scores of the words contained in those sentences.

Type: Grant

Filed: November 2, 2006

Date of Patent: April 20, 2010

Assignee: Microsoft Corporation

Inventors: Wen-tau Yih, Joshua T. Goodman, Lucretia H. Vanderwende, Hisami Suzuki
Classification using a cascade approach

Patent number: 7693806

Abstract: A system and method that facilitates and effectuates optimizing a classifier for greater performance in a specific region of classification that is of interest, such as a low false positive rate or a low false negative rate. A two-stage classification model can be trained and employed, where the first stage classification is optimized over the entire classification region and the second stage classifier is optimized for the specific region of interest. During training the entire set of training data is employed by a first stage classifier. Only data that is classified by the first stage classifier or by cross validation to fall within a region of interest is used to train the second stage classifier. During classification, data that is classified within the region of interest by the first classification is given the first stage classifier's classification value, otherwise the classification value for the instance of data from the second stage classifier is used.

Type: Grant

Filed: June 21, 2007

Date of Patent: April 6, 2010

Assignee: Microsoft Corporation

Inventors: Wen-tau Yih, Joshua T. Goodman, Geoffrey J. Hulten
Using IP address and domain for email spam filtering

Patent number: 7689652

Abstract: Email spam filtering is performed based on a combination of IP address and domain. When an email message is received, an IP address and a domain associated with the email message are determined. A cross product of the IP address (or portions of the IP address) and the domain (or portions of the domain) is calculated. If the email message is known to be either spam or non-spam, then a spam score based on the known spam status is stored in association with each (IP address, domain) pair element of the cross product. If the spam status of the email message is not known, then the (IP address, domain) pair elements of the cross product are used to lookup previously determined spam scores. A combination of the previously determined spam scores is used to determine whether or not to treat the received email message as spam.

Type: Grant

Filed: January 7, 2005

Date of Patent: March 30, 2010

Assignee: Microsoft Corporation

Inventors: Manav Mishra, Elissa E. S. Murphy, Geoffrey J Hulten, Joshua T. Goodman, Wen-Tau Yih
CONSISTENT PHRASE RELEVANCE MEASURES

Publication number: 20090319508

Abstract: Two methods for measuring keyword-document relevance are described. The methods receive a keyword and a document as input and output a probability value for the keyword. The first method is a similarity-based approach which uses techniques for measuring similarity between two short-text segments to measure relevance between the keyword and the document. The second method is a regression-based approach based on an assumption that if an out-of-document phrase (the keyword) is semantically similar to an in-document phrase, then relevance scores of the in and out-of document phrases should be close to each other.

Type: Application

Filed: June 24, 2008

Publication date: December 24, 2009

Applicant: Microsoft Corporation

Inventors: Wen-tau Yih, Christopher A. Meek
SIMILIARITY MEASURES FOR SHORT SEGMENTS OF TEXT

Publication number: 20090240498

Abstract: Systems and methods to perform short text segment similarity measures. Illustratively, a short text segment similarity environment comprises a short text engine operative to process data representative of short segments of text and an instruction set comprising at least one instruction to instruct the short text engine to process data representative of short text segment inputs according to a selected short text similarity identification paradigm. Illustratively, two or more short text segments can be received as input by the short text engine and a request to identify similarities among the two or more short text segments. Responsive to the request and data input, the short text engine executes a selected similarity identification technique in accordance with the sort text similarity identification paradigm to process the received data and to identify similarities between the short text segment inputs.

Type: Application

Filed: March 19, 2008

Publication date: September 24, 2009

Applicant: Microsoft Corporation

Inventors: Wen-tau Yih, Alexei V. Bocharov, Christopher A. Meek
RAISING THE BASELINE FOR HIGH-PRECISION TEXT CLASSIFIERS

Publication number: 20090157720

Abstract: The claimed subject matter provides systems and/or methods for normalizing document representations for use with Naïve Bayes. The system can include devices and components that determine norms associated with documents by aggregating absolute term weight values associated with the documents, and further ascertain term weights for features associated with the documents, and thereafter divides the term weights for the features associated with the documents with the norms associated with the documents to produce a normalized document representation that can be utilized by arbitrary linear classifiers.

Type: Application

Filed: December 12, 2007

Publication date: June 18, 2009

Applicant: Microsoft Corporation

Inventors: Aleksander Kolcz, Wen-tau Yih
CLASSIFICATION USING A CASCADE APPROACH

Publication number: 20080319932

Abstract: A system and method that facilitates and effectuates optimizing a classifier for greater performance in a specific region of classification that is of interest, such as a low false positive rate or a low false negative rate. A two-stage classification model can be trained and employed, where the first stage classification is optimized over the entire classification region and the second stage classifier is optimized for the specific region of interest. During training the entire set of training data is employed by a first stage classifier. Only data that is classified by the first stage classifier or by cross validation to fall within a region of interest is used to train the second stage classifier. During classification, data that is classified within the region of interest by the first classification is given the first stage classifier's classification value, otherwise the classification value for the instance of data from the second stage classifier is used.

Type: Application

Filed: June 21, 2007

Publication date: December 25, 2008

Applicant: MICROSOFT CORPORATION

Inventors: Wen-tau Yih, Joshua T. Goodman, Geoffrey J. Hulten
Training filters for detecting spasm based on IP addresses and text-related features

Patent number: 7464264

Abstract: The subject invention provides for an intelligent quarantining system and method that facilitates detecting and preventing spam. In particular, the invention employs a machine learning filter specifically trained using origination features such as an IP address as well as destination feature such as a URL. Moreover, the system and method involve training a plurality of filters using specific feature data for each filter. The filters are trained independently each other, thus one feature may not unduly influence another feature in determining whether a message is spam. Because multiple filters are trained and available to scan messages either individually or in combination (at least two filters), the filtering or spam detection process can be generalized to new messages having slightly modified features (e.g., IP address). The invention also involves locating the appropriate IP addresses or URLs in a message as well as guiding filters to weigh origination or destination features more than text-based features.

Type: Grant

Filed: March 25, 2004

Date of Patent: December 9, 2008

Assignee: Microsoft Corporation

Inventors: Joshua T. Goodman, Robert L. Rounthwaite, Geoffrey J. Hulten, Wen-tau Yih
Document summarization by maximizing informative content words

Publication number: 20080109425

Abstract: Document summarization is performed by scoring individual words in sentences in a document or document cluster. Sentences from the document or document cluster are selected to form a summary based on the scores of the words contained in those sentences.

Type: Application

Filed: November 2, 2006

Publication date: May 8, 2008

Applicant: Microsoft Corporation

Inventors: Wen-tau Yih, Joshua T. Goodman, Lucretia H. Vanderwende, Hisami Suzuki
WEB DOCUMENT KEYWORD AND PHRASE EXTRACTION

Publication number: 20070112764

Abstract: Extraction analysis techniques biased, in part, by query frequency information from a query log file and/or search engine cache are employed along with machine learning processes to determine candidate keywords and/or phrases of web documents. Web oriented features associated with the candidate keywords and/or phrases are also utilized to analyze the web documents. A keyword and/or phrase extraction mechanism can be utilized to score keywords and/or phrases in a web document and estimate a likelihood that the keywords and/or phrases are relevant, for example, in an advertising system and the like.

Type: Application

Filed: January 3, 2007

Publication date: May 17, 2007

Applicant: MICROSOFT CORPORATION

Inventors: Wen-tau Yih, Joshua Goodman, Vitor de Carvalho
Weighted linear model

Publication number: 20070083357

Abstract: A weighted linear word alignment model linearly combines weighted features to score a word alignment for a bilingual, aligned pair of text fragments. The features are each weighted by a feature weight. One of the features is a word association metric, which may be generated from surface statistics.

Type: Application

Filed: July 12, 2006

Publication date: April 12, 2007

Inventors: Robert Moore, Wen-tau Yih, Galen Andrew, Kristina Toutanova
Using IP address and domain for email spam filtering

Publication number: 20060168041

Abstract: Email spam filtering is performed based on a combination of IP address and domain. When an email message is received, an IP address and a domain associated with the email message are determined. A cross product of the IP address (or portions of the IP address) and the domain (or portions of the domain) is calculated. If the email message is known to be either spam or non-spam, then a spam score based on the known spam status is stored in association with each (IP address, domain) pair element of the cross product. If the spam status of the email message is not known, then the (IP address, domain) pair elements of the cross product are used to lookup previously determined spam scores. A combination of the previously determined spam scores is used to determine whether or not to treat the received email message as spam.

Type: Application

Filed: January 7, 2005

Publication date: July 27, 2006

Applicant: Microsoft Corporation

Inventors: Manav Mishra, Elissa Murphy, Geoffrey Hulten, Joshua Goodman, Wen-Tau Yih
Training filters for IP address and URL learning

Publication number: 20040260922

Abstract: The subject invention provides for an intelligent quarantining system and method that facilitates detecting and preventing spam. In particular, the invention employs a machine learning filter specifically trained using origination features such as an IP address as well as destination feature such as a URL. Moreover, the system and method involve training a plurality of filters using specific feature data for each filter. The filters are trained independently each other, thus one feature may not unduly influence another feature in determining whether a message is spam. Because multiple filters are trained and available to scan messages either individually or in combination (at least two filters), the filtering or spam detection process can be generalized to new messages having slightly modified features (e.g., IP address). The invention also involves locating the appropriate IP addresses or URLs in a message as well as guiding filters to weigh origination or destination features more than text-based features.

Type: Application

Filed: March 25, 2004

Publication date: December 23, 2004

Inventors: Joshua T. Goodman, Robert L. Rounthwaite, Geoffrey J. Hulten, Wen-tau Yih

prev 1 2