Patents by Inventor Wen-tau Yih

Wen-tau Yih has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 7702680
    Abstract: Document summarization is performed by scoring individual words in sentences in a document or document cluster. Sentences from the document or document cluster are selected to form a summary based on the scores of the words contained in those sentences.
    Type: Grant
    Filed: November 2, 2006
    Date of Patent: April 20, 2010
    Assignee: Microsoft Corporation
    Inventors: Wen-tau Yih, Joshua T. Goodman, Lucretia H. Vanderwende, Hisami Suzuki
  • Patent number: 7693806
    Abstract: A system and method that facilitates and effectuates optimizing a classifier for greater performance in a specific region of classification that is of interest, such as a low false positive rate or a low false negative rate. A two-stage classification model can be trained and employed, where the first stage classification is optimized over the entire classification region and the second stage classifier is optimized for the specific region of interest. During training the entire set of training data is employed by a first stage classifier. Only data that is classified by the first stage classifier or by cross validation to fall within a region of interest is used to train the second stage classifier. During classification, data that is classified within the region of interest by the first classification is given the first stage classifier's classification value, otherwise the classification value for the instance of data from the second stage classifier is used.
    Type: Grant
    Filed: June 21, 2007
    Date of Patent: April 6, 2010
    Assignee: Microsoft Corporation
    Inventors: Wen-tau Yih, Joshua T. Goodman, Geoffrey J. Hulten
  • Patent number: 7689652
    Abstract: Email spam filtering is performed based on a combination of IP address and domain. When an email message is received, an IP address and a domain associated with the email message are determined. A cross product of the IP address (or portions of the IP address) and the domain (or portions of the domain) is calculated. If the email message is known to be either spam or non-spam, then a spam score based on the known spam status is stored in association with each (IP address, domain) pair element of the cross product. If the spam status of the email message is not known, then the (IP address, domain) pair elements of the cross product are used to lookup previously determined spam scores. A combination of the previously determined spam scores is used to determine whether or not to treat the received email message as spam.
    Type: Grant
    Filed: January 7, 2005
    Date of Patent: March 30, 2010
    Assignee: Microsoft Corporation
    Inventors: Manav Mishra, Elissa E. S. Murphy, Geoffrey J Hulten, Joshua T. Goodman, Wen-Tau Yih
  • Publication number: 20090319508
    Abstract: Two methods for measuring keyword-document relevance are described. The methods receive a keyword and a document as input and output a probability value for the keyword. The first method is a similarity-based approach which uses techniques for measuring similarity between two short-text segments to measure relevance between the keyword and the document. The second method is a regression-based approach based on an assumption that if an out-of-document phrase (the keyword) is semantically similar to an in-document phrase, then relevance scores of the in and out-of document phrases should be close to each other.
    Type: Application
    Filed: June 24, 2008
    Publication date: December 24, 2009
    Applicant: Microsoft Corporation
    Inventors: Wen-tau Yih, Christopher A. Meek
  • Publication number: 20090240498
    Abstract: Systems and methods to perform short text segment similarity measures. Illustratively, a short text segment similarity environment comprises a short text engine operative to process data representative of short segments of text and an instruction set comprising at least one instruction to instruct the short text engine to process data representative of short text segment inputs according to a selected short text similarity identification paradigm. Illustratively, two or more short text segments can be received as input by the short text engine and a request to identify similarities among the two or more short text segments. Responsive to the request and data input, the short text engine executes a selected similarity identification technique in accordance with the sort text similarity identification paradigm to process the received data and to identify similarities between the short text segment inputs.
    Type: Application
    Filed: March 19, 2008
    Publication date: September 24, 2009
    Applicant: Microsoft Corporation
    Inventors: Wen-tau Yih, Alexei V. Bocharov, Christopher A. Meek
  • Publication number: 20090157720
    Abstract: The claimed subject matter provides systems and/or methods for normalizing document representations for use with Naïve Bayes. The system can include devices and components that determine norms associated with documents by aggregating absolute term weight values associated with the documents, and further ascertain term weights for features associated with the documents, and thereafter divides the term weights for the features associated with the documents with the norms associated with the documents to produce a normalized document representation that can be utilized by arbitrary linear classifiers.
    Type: Application
    Filed: December 12, 2007
    Publication date: June 18, 2009
    Applicant: Microsoft Corporation
    Inventors: Aleksander Kolcz, Wen-tau Yih
  • Publication number: 20080319932
    Abstract: A system and method that facilitates and effectuates optimizing a classifier for greater performance in a specific region of classification that is of interest, such as a low false positive rate or a low false negative rate. A two-stage classification model can be trained and employed, where the first stage classification is optimized over the entire classification region and the second stage classifier is optimized for the specific region of interest. During training the entire set of training data is employed by a first stage classifier. Only data that is classified by the first stage classifier or by cross validation to fall within a region of interest is used to train the second stage classifier. During classification, data that is classified within the region of interest by the first classification is given the first stage classifier's classification value, otherwise the classification value for the instance of data from the second stage classifier is used.
    Type: Application
    Filed: June 21, 2007
    Publication date: December 25, 2008
    Applicant: MICROSOFT CORPORATION
    Inventors: Wen-tau Yih, Joshua T. Goodman, Geoffrey J. Hulten
  • Patent number: 7464264
    Abstract: The subject invention provides for an intelligent quarantining system and method that facilitates detecting and preventing spam. In particular, the invention employs a machine learning filter specifically trained using origination features such as an IP address as well as destination feature such as a URL. Moreover, the system and method involve training a plurality of filters using specific feature data for each filter. The filters are trained independently each other, thus one feature may not unduly influence another feature in determining whether a message is spam. Because multiple filters are trained and available to scan messages either individually or in combination (at least two filters), the filtering or spam detection process can be generalized to new messages having slightly modified features (e.g., IP address). The invention also involves locating the appropriate IP addresses or URLs in a message as well as guiding filters to weigh origination or destination features more than text-based features.
    Type: Grant
    Filed: March 25, 2004
    Date of Patent: December 9, 2008
    Assignee: Microsoft Corporation
    Inventors: Joshua T. Goodman, Robert L. Rounthwaite, Geoffrey J. Hulten, Wen-tau Yih
  • Publication number: 20080109425
    Abstract: Document summarization is performed by scoring individual words in sentences in a document or document cluster. Sentences from the document or document cluster are selected to form a summary based on the scores of the words contained in those sentences.
    Type: Application
    Filed: November 2, 2006
    Publication date: May 8, 2008
    Applicant: Microsoft Corporation
    Inventors: Wen-tau Yih, Joshua T. Goodman, Lucretia H. Vanderwende, Hisami Suzuki
  • Publication number: 20070112764
    Abstract: Extraction analysis techniques biased, in part, by query frequency information from a query log file and/or search engine cache are employed along with machine learning processes to determine candidate keywords and/or phrases of web documents. Web oriented features associated with the candidate keywords and/or phrases are also utilized to analyze the web documents. A keyword and/or phrase extraction mechanism can be utilized to score keywords and/or phrases in a web document and estimate a likelihood that the keywords and/or phrases are relevant, for example, in an advertising system and the like.
    Type: Application
    Filed: January 3, 2007
    Publication date: May 17, 2007
    Applicant: MICROSOFT CORPORATION
    Inventors: Wen-tau Yih, Joshua Goodman, Vitor de Carvalho
  • Publication number: 20070083357
    Abstract: A weighted linear word alignment model linearly combines weighted features to score a word alignment for a bilingual, aligned pair of text fragments. The features are each weighted by a feature weight. One of the features is a word association metric, which may be generated from surface statistics.
    Type: Application
    Filed: July 12, 2006
    Publication date: April 12, 2007
    Inventors: Robert Moore, Wen-tau Yih, Galen Andrew, Kristina Toutanova
  • Publication number: 20060168041
    Abstract: Email spam filtering is performed based on a combination of IP address and domain. When an email message is received, an IP address and a domain associated with the email message are determined. A cross product of the IP address (or portions of the IP address) and the domain (or portions of the domain) is calculated. If the email message is known to be either spam or non-spam, then a spam score based on the known spam status is stored in association with each (IP address, domain) pair element of the cross product. If the spam status of the email message is not known, then the (IP address, domain) pair elements of the cross product are used to lookup previously determined spam scores. A combination of the previously determined spam scores is used to determine whether or not to treat the received email message as spam.
    Type: Application
    Filed: January 7, 2005
    Publication date: July 27, 2006
    Applicant: Microsoft Corporation
    Inventors: Manav Mishra, Elissa Murphy, Geoffrey Hulten, Joshua Goodman, Wen-Tau Yih
  • Publication number: 20040260922
    Abstract: The subject invention provides for an intelligent quarantining system and method that facilitates detecting and preventing spam. In particular, the invention employs a machine learning filter specifically trained using origination features such as an IP address as well as destination feature such as a URL. Moreover, the system and method involve training a plurality of filters using specific feature data for each filter. The filters are trained independently each other, thus one feature may not unduly influence another feature in determining whether a message is spam. Because multiple filters are trained and available to scan messages either individually or in combination (at least two filters), the filtering or spam detection process can be generalized to new messages having slightly modified features (e.g., IP address). The invention also involves locating the appropriate IP addresses or URLs in a message as well as guiding filters to weigh origination or destination features more than text-based features.
    Type: Application
    Filed: March 25, 2004
    Publication date: December 23, 2004
    Inventors: Joshua T. Goodman, Robert L. Rounthwaite, Geoffrey J. Hulten, Wen-tau Yih