Syntactic Pre-processing Steps, E.g., Stopword Elimination, Stemming, Etc. (epo) Patents (Class 707/E17.072)
  • Patent number: 8161041
    Abstract: One embodiment of the present invention provides a system that automatically generates synonyms for words from documents. During operation, this system determines co-occurrence frequencies for pairs of words in the documents. The system also determines closeness scores for pairs of words in the documents, wherein a closeness score indicates whether a pair of words are located so close to each other that the words are likely to occur in the same sentence or phrase. Finally, the system determines whether pairs of words are synonyms based on the determined co-occurrence frequencies and the determined closeness scores. While making this determination, the system can additionally consider correlations between words in a title or an anchor of a document and words in the document as well as word-form scores for pairs of words in the documents.
    Type: Grant
    Filed: February 10, 2011
    Date of Patent: April 17, 2012
    Assignee: Google Inc.
    Inventors: Oleksandr Grushetskyy, Steven D. Baker
  • Publication number: 20100228762
    Abstract: In various example embodiments, a system and method to provide query linguistic service is provided. An initial query term set is received. Phrase recognition is performed on the initial query term set to determine recognized phrases. Using the determined recognized phrases, one or more synonyms for each of the recognized phrases are determined. Results matching the initial query term set and any selected synonyms from the determined one or more synonyms are determined.
    Type: Application
    Filed: March 5, 2010
    Publication date: September 9, 2010
    Inventors: Karin Mauge', Radoslav Valentinov Petranov, Jean-David Ruvini, Antoniya T. Statelova, Neelakantan Sundaresan
  • Publication number: 20080228748
    Abstract: A stemming framework for combining stemming algorithms together in a multilingual environment to obtain improved stemming behavior over any individual stemming algorithm, together with a new language independent stemming algorithm based on shortest path techniques. The stemmer essentially treats the stemming problem as a simple instance of the shortest path problem where the cost for each path can be computed from its word component and its number of characters. The goal of the stemmer is to find the shortest path to construct the entire word. The stemmer uses dynamic dictionaries constructed as lexical analyzer state transition tables to recognize the various allowable word parts for any given language in order to obtain maximum speed. The stemming framework provides the necessary logic to combine multiple stemmers in parallel and to merge their results to obtain the best behavior. Mapping dictionaries handle irregular plurals, tense, phrase mapping and proper name recognition.
    Type: Application
    Filed: March 16, 2007
    Publication date: September 18, 2008
    Inventor: John Fairweather