Abstract: A method for extracting tabular information from a web source by determining a plurality of coordinates for a plurality of visualized element nodes on the web source; determining a subset of the plurality of visualized element nodes based on the plurality of coordinates to obtain a candidate web table, wherein each of the subset of the plurality of visualized element nodes constitutes a logical cell of the candidate web table; determining textual content corresponding to the subset of the plurality of visualized element nodes as the textual content would appear after rendering the web source in a browser; and transforming the candidate web table into an explicit representation of relative spatial relation between at least one of the logical cell; and saving the explicit representation in a structured document format.
Type:
Grant
Filed:
April 24, 2008
Date of Patent:
May 6, 2014
Assignee:
Lixto Software GmbH
Inventors:
Wolfgang Gatterbauer, Bernhard Kruepl, Paul Bohunsky, Marcus Herzog
Abstract: A method and a system for information extraction from Web pages formatted with markup languages such as HTML [8]. A method and system for interactively and visually describing information patterns of interest based on visualized sample Web pages [5,6,16-29]. A method and data structure for representing and storing these patterns [1]. A method and system for extracting information corresponding to a set of previously defined patterns from Web pages [2], and a method for transforming the extracted data into XML is described. Each pattern is defined via the (interactive) specification of one or more filters. Two or more filters for the same pattern contribute disjunctively to the pattern definition [3], that is, an actual pattern describes the set of all targets specified by any of its filters.
Type:
Grant
Filed:
May 28, 2002
Date of Patent:
August 25, 2009
Assignee:
Lixto Software GmbH
Inventors:
Robert Baumgartner, Sergio I'Lesca, Georg Gottlob, Marcus Herzoo
Abstract: A method for extracting tabular information from a web source by determining a plurality of coordinates for a plurality of visualized element nodes on the web source; determining a subset of the plurality of visualized element nodes based on the plurality of coordinates to obtain a candidate web table, wherein each of the subset of the plurality of visualized element nodes constitutes a logical cell of the candidate web table; determining textual content corresponding to the subset of the plurality of visualized element nodes as the textual content would appear after rendering the web source in a browser; and transforming the candidate web table into an explicit representation of relative spatial relation between at least one of the logical cell; and saving the explicit representation in a structured document format.
Type:
Application
Filed:
April 24, 2008
Publication date:
November 27, 2008
Applicant:
LIXTO SOFTWARE GMBH
Inventors:
Wolfgang Gatterbauer, Bernhard Kruepl, Paul Bohunsky, Marcus Herzog