Patents Assigned to The Board of Regents of the University of Washington, Office of Technology Transfer
  • Patent number: 6304870
    Abstract: A procedure is disclosed for automatically constructing wrappers for performing information-extraction from sites such as Internet resources that display relevant information, interspersed with extraneous text fragments, such as HTML formatting commands or advertisements. The procedure has three basic steps. First, a set of example pages are collected with a subroutine named GatherExamples. Gather Examples is provided with information describing how to pose example queries to the site whose wrapper is to be learned. Second, these example pages are labeled by a subroutine named LabelExamples—i.e., the information to be extracted from each example is identified for use in the third step. The LabelExamples subroutine uses a general framework for labeling pages using site-specific heuristics called recognizers, as well as allowing users to correct and modify the recognized instances. Finally, the labeled example pages are passed to a BuildWrapper subroutine, which constructs a wrapper.
    Type: Grant
    Filed: December 2, 1997
    Date of Patent: October 16, 2001
    Assignee: The Board of Regents of the University of Washington, Office of Technology Transfer
    Inventors: Nicholas Kushmerick, Daniel S. Weld, Robert B. Doorenbos