Abstract: A method and system for mapping labels of documents is described. A training set including a plurality of documents and at least one map can be retrieved. Each document can include a plurality of labels, and the at least one map can represent associations between the labels of one document and another document in the set. Each document (or group of documents) in the set can include certain features. These features can relate to the labels in the documents. Each label can correspond to one or more data points (or datasets) in each documents. In one example embodiment, the map can be generated based on the features extracted from each document.
Type:
Grant
Filed:
May 13, 2020
Date of Patent:
February 13, 2024
Assignee:
FACTSET RESEARCH SYSTEM INC.
Inventors:
Yan Chen, Agrima Srivastava, Dakshina Murthy Malladi
Abstract: There is provided a method to identify structure of a native PDF document. The method comprises: obtaining a native PDF document having a first line to start a table and a second line to end the table; detecting a value of a physical feature of the native PDF document, wherein the physical feature has a corresponding weighting factor; initiating a value to the weighting factor; assigning a first status for the first line and a second status for the second line based on (a) the physical feature and (b) the weighting factor; and identifying a location of the table on native PDF document from the first status and the second status, thus yielding an identified location of the table.