Patents by Inventor Michele Dolfi

Michele Dolfi has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Ground truth generation from scanned documents

Patent number: 11017498

Abstract: A plurality of electronic documents comprising one or more document pages are received. First position markers, second position markers and page identifiers are inserted to the pages. The plurality of electronic documents are printed, thereby generating a printed corpus comprising a plurality of printed documents. The plurality of printed documents are scanned, thereby generating a scanned corpus comprising a plurality of scanned images. Scanning frame positions of the first and the second position markers are detected and the detected scanning frame positions and the page positions are used to define affine transformations between the plurality of scanned images and the corresponding document pages. The affine transformations are applied to the plurality of scanned images to align the plurality of scanned images with the corresponding document pages of the plurality of electronic documents.

Type: Grant

Filed: March 14, 2019

Date of Patent: May 25, 2021

Assignee: International Business Machines Corporation

Inventors: Peter Willem Jan Staar, Michele Dolfi, Christoph Auer, Leonidas Georgopoulos, Konstantinos Bekas
Digital image-based document digitization using a graph model

Patent number: 10885323

Abstract: A computer-implemented method for digitizing a document, wherein the document has assigned a classification scheme may be provided. A digital image and an identifier of the classification scheme may be received, the image representing a portion of the document. A segmentation of the image may be determined into one or more image segments; for each of the image segments, content information may be captured from the image segment and a category may be assigned to the image segment, the category being selected from the classification scheme. One or more digitization segments may be selected from the segmentation. A graph model of the document may be populated, wherein each of the digitization segments is represented by a segment node of the graph model.

Type: Grant

Filed: February 28, 2019

Date of Patent: January 5, 2021

Assignee: International Business Machines Corporation

Inventors: Peter Willem Jan Staar, Michele Dolfi, Christoph Auer, Leonidas Georgopoulos, Konstantinos Bekas
TRANSLATING A NATURAL LANGUAGE QUERY INTO A FORMAL DATA QUERY

Publication number: 20200401590

Abstract: A computer-implemented method for generating ground-truth for natural language querying may include providing a knowledge graph as data model, receiving a natural language query from a user and translating the natural language query into a formal data query. The method can also include visualizing the formal data query to the user and receiving a feedback response from the user. The feedback response can include a verified and/or edited formal data query. The method can also include storing the natural language query and the corresponding feedback response as ground-truth pair. Corresponding system and a related computer program product may be provided.

Type: Application

Filed: June 20, 2019

Publication date: December 24, 2020

Inventors: Peter Willem Jan Staar, Michele Dolfi, Christoph Auer, Leonidas Georgopoulos, Aleksandros Sobczyk, Tim Jan Baccaert, Konstantinos Bekas
Collecting training data from TeX files

Patent number: 10824788

Abstract: A method of collecting training data of a document component may be provided. The documents have a structure and are coded in the typesetting language TeX. The method comprise receiving a TeX source file, compiling it into a PDF file and a related sync file, analyzing the PDF file, thereby determining a non-text-only document component. The method comprises also determining first coordinates of the non-text-only document component and a corresponding page number, determining a typesetting command relating to a non-text-only document component and determining second coordinates of a bounding box and a corresponding page number from the sync file, determining text elements in the non-text-only document component of the PDF file for which the first coordinates and the second coordinates overlap, and combining the determined text elements and linking them to a type of a non-text document component determined in the non-text-only document component in the TeX source file.

Type: Grant

Filed: February 8, 2019

Date of Patent: November 3, 2020

Assignee: International Business Machines Corporation

Inventors: Peter Willem Jan Staar, Michele Dolfi, Christoph Auer, Aleksandros Sobczyk, Konstantinos Bekas
GRAPH BASED HYPOTHESIS COMPUTING

Publication number: 20200302307

Abstract: Embodiments of the invention disclose a computer-implemented method for the automatic generation of a hypothesis from a graph. The method includes receiving an initial graph, wherein the initial graph includes a plurality of nodes and a plurality of edges between the plurality of nodes. A predefined property of the initial graph is computed, and one or more of the plurality of edges of the initial graph are amended, thereby creating an amended graph that includes a plurality of original edges and one or more amended edges. The predefined property of the amended graph is computed, and the predefined property of the initial graph is compared with the predefined property of the amended graph. The one or more amended edges are marked as hypothesis if a predefined measure of difference between the predefined property of the initial graph and the predefined property of the amended graph exceeds a predefined threshold.

Type: Application

Filed: March 21, 2019

Publication date: September 24, 2020

Inventors: Konstantinos Bekas, Peter Staar, Christoph Auer, Michele Dolfi, Alessandro Curioni
GROUND TRUTH GENERATION FROM SCANNED DOCUMENTS

Publication number: 20200294187

Abstract: A plurality of electronic documents comprising one or more document pages are received. First position markers, second position markers and page identifiers are inserted to the pages. The plurality of electronic documents are printed, thereby generating a printed corpus comprising a plurality of printed documents. The plurality of printed documents are scanned, thereby generating a scanned corpus comprising a plurality of scanned images. Scanning frame positions of the first and the second position markers are detected and the detected scanning frame positions and the page positions are used to define affine transformations between the plurality of scanned images and the corresponding document pages. The affine transformations are applied to the plurality of scanned images to align the plurality of scanned images with the corresponding document pages of the plurality of electronic documents.

Type: Application

Filed: March 14, 2019

Publication date: September 17, 2020

Inventors: Peter Willem Jan Staar, Michele Dolfi, Christoph Auer, Leonidas Georgopoulos, Konstantinos Bekas
DIGITAL IMAGE-BASED DOCUMENT DIGITIZATION USING A GRAPH MODEL

Publication number: 20200279107

Abstract: A computer-implemented method for digitizing a document, wherein the document has assigned a classification scheme may be provided. A digital image and an identifier of the classification scheme may be received, the image representing a portion of the document. A segmentation of the image may be determined into one or more image segments; for each of the image segments, content information may be captured from the image segment and a category may be assigned to the image segment, the category being selected from the classification scheme. One or more digitization segments may be selected from the segmentation. A graph model of the document may be populated, wherein each of the digitization segments is represented by a segment node of the graph model.

Type: Application

Filed: February 28, 2019

Publication date: September 3, 2020

Inventors: Peter Willem Jan Staar, Michele Dolfi, Christoph Auer, Leonidas Georgopoulos, Konstantinos Bekas
COLLECTING TRAINING DATA FROM TeX FILES

Publication number: 20200257755

Abstract: A method of collecting training data of a document component may be provided. The documents have a structure and are coded in the typesetting language TeX. The method comprise receiving a TeX source file, compiling it into a PDF file and a related sync file, analyzing the PDF file, thereby determining a non-text-only document component. The method comprises also determining first coordinates of the non-text-only document component and a corresponding page number, determining a typesetting command relating to a non-text-only document component and determining second coordinates of a bounding box and a corresponding page number from the sync file, determining text elements in the non-text-only document component of the PDF file for which the first coordinates and the second coordinates overlap, and combining the determined text elements and linking them to a type of a non-text document component determined in the non-text-only document component in the TeX source file.

Type: Application

Filed: February 8, 2019

Publication date: August 13, 2020

Inventors: Peter Willem Jan Staar, Michele Dolfi, Christoph Auer, Aleksandros Sobczyk, Konstantinos Bekas

prev 1 2