Patents by Inventor Herve Dejean

Herve Dejean has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 10803233
    Abstract: This disclosure provides an exemplary method and system for extracting structured data from an unstructured textual document. According to an exemplary method, initially a layout analysis is performed resulting in one or more alternatives for grouping and ordering the page elements of interest. Next, the content of these page elements are tagged based on application-specific heuristics. Finally, a sequence-based method is applied to the tags for identifying repetitive contiguous patterns.
    Type: Grant
    Filed: December 16, 2013
    Date of Patent: October 13, 2020
    Assignee: Conduent Business Services LLC
    Inventors: Hervé Déjean, Darren S. Schroeder
  • Publication number: 20180129944
    Abstract: A multi-page document is represented as a graph in which extracted page objects of the document, such as text blocks, are represented by nodes that are connected by intra-page edges and/or cross-page edges. The nodes and edges of the graph are associated with respective sets of features, the edge features distinguishing between intra-page and cross-page edges. A trained first model jointly predicts class labels for page objects, based on node and edge features. Page labels for the pages may be predicted, based on the page object predictions, optionally enforcing a constraint, such a maximum of one class label for a given class, per page. The pages can be assigned a respective category, based on the predicted classes of the page objects and respective features. Information based on the predictions is output, such as one or more of the page object class labels, the page labels, and information based thereon.
    Type: Application
    Filed: November 7, 2016
    Publication date: May 10, 2018
    Applicant: Xerox Corporation
    Inventors: Jean-Luc Meunier, Hervé Déjean
  • Patent number: 9965809
    Abstract: Disclosed is a method and system for extracting a mathematical structure associated with a financial table. According to an exemplary embodiment, the method uses a LR-(Left-to-Right) parser reducing stack and a LR-parser nonreducing stack to generate a final reducing stack representative of the mathematical structure.
    Type: Grant
    Filed: July 25, 2016
    Date of Patent: May 8, 2018
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Publication number: 20180025436
    Abstract: Disclosed is a method and system for extracting a mathematical structure associated with a financial table. According to an exemplary embodiment, the method uses a LR-(Left-to-Right) parser reducing stack and a LR-parser nonreducing stack to generate a final reducing stack representative of the mathematical structure.
    Type: Application
    Filed: July 25, 2016
    Publication date: January 25, 2018
    Applicant: Xerox Corporation
    Inventor: Hervé Déjean
  • Patent number: 9798711
    Abstract: This disclosure provides a method and system of generating a graphical organization of a document page. According to an exemplary embodiment, the method includes identifying grid-based structures represented by graphical lines of a document page. The exemplary method includes a sequence of steps where a rectangular zone associated with the page is analyzed by looking for lines that entirely cross the zone, either horizontally or vertically. A hierarchy of grid-based structures are then identified, which can be used for analysis of the document and/or data extraction.
    Type: Grant
    Filed: December 1, 2015
    Date of Patent: October 24, 2017
    Assignee: XEROX CORPORATION
    Inventor: Hervé Déjean
  • Patent number: 9672195
    Abstract: Disclosed is a method and system that generates a page construct structure associated with a sequentially-ordered set of pages, each being characterized by a set of page construct features. N-grams, i.e., a sequence of n features, are computed from a set of page construct features for n contiguous pages, and n-grams which are repetitive are selected. Pages matching the most frequent repetitive n-ram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.
    Type: Grant
    Filed: December 24, 2013
    Date of Patent: June 6, 2017
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Publication number: 20170154025
    Abstract: This disclosure provides a method and system of generating a graphical organization of a document page. According to an exemplary embodiment, the method includes identifying grid-based structures represented by graphical lines of a document page. The exemplary method includes a sequence of steps where a rectangular zone associated with the page is analyzed by looking for lines that entirely cross the zone, either horizontally or vertically. A hierarchy of grid-based structures are then identified, which can be used for analysis of the document and/or data extraction.
    Type: Application
    Filed: December 1, 2015
    Publication date: June 1, 2017
    Applicant: Xerox Corporation
    Inventor: Hervé Déjean
  • Patent number: 9613267
    Abstract: This disclosure provides an exemplary method and system for extracting structured label and value pairwise textual data from a textual document. According to an exemplary method, initially a layout analysis is performed resulting in one or more alternatives for grouping and ordering the textual elements of interest. Next, textual elements are tagged as including a label term, a value term or a label and value term. Finally, a sequence-based method is applied to the tagged elements to generate one or more sequence listings representative of the label and value pairwise data structure(s) and label:value pairwise data is extracted.
    Type: Grant
    Filed: September 3, 2014
    Date of Patent: April 4, 2017
    Assignee: Xerox Corporation
    Inventors: Hervé Déjean, Thierry Lehoux, Eric H. Cheminot
  • Patent number: 9524274
    Abstract: Disclosed is a method that structures a sequentially-ordered set of elements, each being characterized by a set of features. N-grams (sequence of n features) are computed from a set for n contiguous elements, and n-grams which are repetitive (Kleene cross) are selected. Elements matching the most frequent repetitive n-gram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.
    Type: Grant
    Filed: June 6, 2013
    Date of Patent: December 20, 2016
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Publication number: 20160063322
    Abstract: This disclosure provides an exemplary method and system for extracting structured label and value pairwise textual data from a textual document. According to an exemplary method, initially a layout analysis is performed resulting in one or more alternatives for grouping and ordering the textual elements of interest. Next, textual elements are tagged as including a label term, a value term or a label and value term. Finally, a sequence-based method is applied to the tagged elements to generate one or more sequence listings representative of the label and value pairwise data structure(s) and label:value pairwise data is extracted.
    Type: Application
    Filed: September 3, 2014
    Publication date: March 3, 2016
    Inventors: Hervé Déjean, Thierry Lehoux, Eric H. Cheminot
  • Patent number: 9224041
    Abstract: An initial organizational table for a document is determined based on textual similarity between entries of the organizational table and target text fragments and not taking into account text formatting. A classifier is trained to identify text fragment pairs consisting of entries of the organizational table and corresponding target text fragments based at least in part on text formatting features. The training employs a training set of examples annotated based on the initial organizational table. The initial organizational table is updated using the trained classifier.
    Type: Grant
    Filed: October 25, 2007
    Date of Patent: December 29, 2015
    Assignee: XEROX CORPORATION
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Patent number: 9218326
    Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.
    Type: Grant
    Filed: February 23, 2011
    Date of Patent: December 22, 2015
    Assignee: Xerox Corporation
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Patent number: 9189461
    Abstract: Disclosed is a method that generates a page frame structure associated with a sequentially-ordered set of pages, each being characterized by a set of page frame features. N-grams (sequence of n features) are computed from a set for n contiguous pages, and n-grams which are repetitive (Kleene cross) are selected. Pages matching the most frequent repetitive n-ram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.
    Type: Grant
    Filed: July 16, 2013
    Date of Patent: November 17, 2015
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Patent number: 9135249
    Abstract: Numbered sequences detection includes (i) extracting one or more numbered item token patterns from a document comprising an ordered sequence of text units, each numbered item token pattern including an incremental portion and a fixed portion that matches at least one text unit of the document and (ii) identifying at least one numbered sequence in the document conforming with a matching numbered item token pattern of the extracted one or more numbered item token patterns. The identified at least one numbered sequence comprises an ordered sub-sequence of text units of the document that match the matching numbered item token pattern. The detection may further comprise determining that a second type of numbered sequence nests in the document between consecutive text units belonging to a numbered sequence of a first type, and optimizing one or more numbered sequences of the second type based on information provided by the determining.
    Type: Grant
    Filed: May 29, 2009
    Date of Patent: September 15, 2015
    Assignee: XEROX Corporation
    Inventor: Herve Dejean
  • Patent number: 9110868
    Abstract: A system, method, and computer program product for determining the structure of a document are provided. The method includes receiving a set of document pages for a document and linking one page frame to each of a plurality of document pages in the set. For each document page linked to a page frame, a content bounding box surrounding the content on the document page is identified, and the document page categorized, based at least in part on the geometrical relationship between the page frame and the content bounding box of the document page. The document page can then be identified as a logical cut based at least in part on the categorization of the document page. Information, such as a table of contents or updated table of contents, can then be output, based on the determined logical unit(s) of the document.
    Type: Grant
    Filed: December 21, 2010
    Date of Patent: August 18, 2015
    Assignee: XEROX CORPORATION
    Inventor: Hervé Déjean
  • Publication number: 20150169510
    Abstract: This disclosure provides an exemplary method and system for extracting structured data from an unstructured textual document. According to an exemplary method, initially a layout analysis is performed resulting in one or more alternatives for grouping and ordering the page elements of interest. Next, the content of these page elements are tagged based on application-specific heuristics. Finally, a sequence-based method is applied to the tags for identifying repetitive contiguous patterns.
    Type: Application
    Filed: December 16, 2013
    Publication date: June 18, 2015
    Applicant: Xerox Corporation
    Inventors: Hervé Déjean, Darren S. Schroeder
  • Patent number: 9008443
    Abstract: A system and method for identifying regular geometric structures in a document page are disclosed. In the method, for a document page for which a set of page elements have been identified, the method includes identifying, where present, geometric relations among a subset of the page elements, from a predefined set of geometric relations, and a geometric structure comprising regular rows and regular columns, based on the identified geometric relations. Constraints of a definition of a regular geometric structure are applied to the identified geometric structure and, where the subset of page elements includes regular rows and regular columns forming a geometric structure which meets the constraints of the definition of a regular geometric structure, the subset of the page elements is identified as forming a regular geometric structure and may be labeled or tested to determine if it can be expanded by adding one or more rows or columns.
    Type: Grant
    Filed: June 22, 2012
    Date of Patent: April 14, 2015
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Patent number: 9008425
    Abstract: A method of detection of numbered captions in a document includes receiving a document including a sequence of document pages and identifying illustrations on pages of the document. For each identified illustration, associated text is identified. An imitation page is generated for each of the identified illustrations, each imitation page comprising a single illustration and its associated text. For a sequence of the imitation pages, a sequence of terms is identified. Each term is derived from a text fragment of the associate text of a respective imitation page. The terms of a sequence complying with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. The terms of the identified sequence of terms are construed as being at least a part of a numbered caption for a respective illustration in the document.
    Type: Grant
    Filed: January 29, 2013
    Date of Patent: April 14, 2015
    Assignee: Xerox Corporation
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Publication number: 20150026558
    Abstract: Disclosed is a method that generates a page frame structure associated with a sequentially-ordered set of pages, each being characterized by a set of page frame features. N-grams (sequence of n features) are computed from a set for n contiguous pages, and n-grams which are repetitive (Kleene cross) are selected. Pages matching the most frequent repetitive n-ram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.
    Type: Application
    Filed: July 16, 2013
    Publication date: January 22, 2015
    Inventor: Hervé Déjean
  • Publication number: 20140365872
    Abstract: Disclosed is a method that structures a sequentially-ordered set of elements, each being characterized by a set of features. N-grams (sequence of n features) are computed from a set for n contiguous elements, and n-grams which are repetitive (Kleene cross) are selected. Elements matching the most frequent repetitive n-gram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.
    Type: Application
    Filed: June 6, 2013
    Publication date: December 11, 2014
    Inventor: Hervé Déjean