Patents by Inventor Hervé Dejean

Hervé Dejean has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 8812870
    Abstract: A method and system for document processing allow a service provider to process a document without having access the textual content of the document. The system includes memory which receives an encoded source document from an associated client system. The encoded source document includes structural information and encoded content information. The encoded content information includes a plurality of encoded tokens generated by individually encoding each of a plurality of text tokens of the source document. The structural information includes location information for each of the plurality of text tokens. A processing module processes the encoded document to generate a modified document, without decoding the encoded tokens. A transmission module transmits the modified document to an associated client system whereby the client system is able to generate a transformed document based on the modified document and the plurality of text tokens.
    Type: Grant
    Filed: October 10, 2012
    Date of Patent: August 19, 2014
    Assignee: Xerox Corporation
    Inventors: Jean-Luc Meunier, Herve Dejean
  • Publication number: 20140212038
    Abstract: A method of detection of numbered captions in a document includes receiving a document including a sequence of document pages and identifying illustrations on pages of the document. For each identified illustration, associated text is identified. An imitation page is generated for each of the identified illustrations, each imitation page comprising a single illustration and its associated text. For a sequence of the imitation pages, a sequence of terms is identified. Each term is derived from a text fragment of the associate text of a respective imitation page. The terms of a sequence complying with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. The terms of the identified sequence of terms are construed as being at least a part of a numbered caption for a respective illustration in the document.
    Type: Application
    Filed: January 29, 2013
    Publication date: July 31, 2014
    Applicant: XEROX CORPORATION
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Patent number: 8719700
    Abstract: A computer-implemented method and system for generation of page templates are provided. The method includes providing a document in computer memory. Using a computer processor, page elements within the document are identified and labeled. For each page of the document, a set of geometric relations between pairs of page elements co-occurring on the page is computed, and the set of geometric relations is associated with the page. The method also includes generating a set of page template candidates based at least in part on the computed geometric relations, selecting page templates from the set of page template candidates, and outputting the selected page templates.
    Type: Grant
    Filed: May 4, 2010
    Date of Patent: May 6, 2014
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Patent number: 8706475
    Abstract: In a method for identifying a table of contents in a document, an ordered sequence of text fragments is derived from the document. A table of contents is selected as a contiguous sub-sequence of the ordered sequence of text fragments satisfying the criteria: (i) entries defined by text fragments of the table of contents each have a link to a target text fragment having textual similarity with the entry; (ii) no target text fragment lies within the table of contents; and (iii) the target text fragments have an ascending ordering corresponding to an ascending ordering of the entries defining the target text fragments.
    Type: Grant
    Filed: January 10, 2005
    Date of Patent: April 22, 2014
    Assignee: Xerox Corporation
    Inventors: Herve Dejean, Jean-Luc Meunier, Olivier Fambon
  • Publication number: 20140101456
    Abstract: A method and system for document processing allow a service provider to process a document without having access the textual content of the document. The system includes memory which receives an encoded source document from an associated client system. The encoded source document includes structural information and encoded content information. The encoded content information includes a plurality of encoded tokens generated by individually encoding each of a plurality of text tokens of the source document. The structural information includes location information for each of the plurality of text tokens. A processing module processes the encoded document to generate a modified document, without decoding the encoded tokens. A transmission module transmits the modified document to an associated client system whereby the client system is able to generate a transformed document based on the modified document and the plurality of text tokens.
    Type: Application
    Filed: October 10, 2012
    Publication date: April 10, 2014
    Applicant: XEROX CORPORATION
    Inventors: Jean-Luc Meunier, Herve Dejean
  • Patent number: 8645819
    Abstract: A method and a system for detecting and extracting images in an electronic document are disclosed. The method includes receiving an electronic document and identifying elements of a page. The identified elements include a set of graphical elements and a set of text elements. The method may include identifying and excluding elements which serve as graphical page constructs and/or text formatting elements. The page can then be segmented, based on (remaining) graphical elements and identified white spaces, to generate a set of image blocks. Text elements that are associated with a respective image block are identified as captions. Overlapping candidate images are then grouped to form a new image. The new image can thus include candidate images which would, without the identification of their caption(s), each be treated as a respective image.
    Type: Grant
    Filed: June 17, 2011
    Date of Patent: February 4, 2014
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Patent number: 8645821
    Abstract: A system and method for page frame detection for pages of a document are disclosed. The method includes receiving a set of document pages for a document, each page having at least one detected object. For each page in the set, the method includes determining dimensions of bounding box which encompasses the detected objects of the page and determining margin dimensions, based on a position of the bounding box on the page. A page frame is computed as a combination of bounding box dimensions and margin dimensions, based on frequencies of the bounding box dimensions and margin dimensions computed for the set of pages. The computed page frame is matched to pages of the document. Information based on the matching, such as content of text objects within the matched page frame, can be output.
    Type: Grant
    Filed: September 28, 2010
    Date of Patent: February 4, 2014
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Publication number: 20130343658
    Abstract: A system and method for identifying regular geometric structures in a document page are disclosed. In the method, for a document page for which a set of page elements have been identified, the method includes identifying, where present, geometric relations among a subset of the page elements, from a predefined set of geometric relations, and a geometric structure comprising regular rows and regular columns, based on the identified geometric relations. Constraints of a definition of a regular geometric structure are applied to the identified geometric structure and, where the subset of page elements includes regular rows and regular columns forming a geometric structure which meets the constraints of the definition of a regular geometric structure, the subset of the page elements is identified as forming a regular geometric structure and may be labeled or tested to determine if it can be expanded by adding one or more rows or columns.
    Type: Application
    Filed: June 22, 2012
    Publication date: December 26, 2013
    Applicant: XEROX CORPORATION
    Inventor: Hervé Déjéan
  • Publication number: 20130321867
    Abstract: Embodiments of a computer-implemented method for grouping one or more token elements comprising one or more characters in an input file. The method comprises computing a first leading distance between a first baseline of a first token element, and a second baseline of a second token element. The method further comprises defining a block with the first token element and the second token element, and characterizing the first leading distance as a leading distance of the block. The method further comprises computing a second leading distance between the second baseline and a third baseline of a third token element. The method furthermore comprises, grouping the third token element in to the block based on a first difference between the second leading distance and the leading distance of the block lying within a first predefined threshold value.
    Type: Application
    Filed: May 31, 2012
    Publication date: December 5, 2013
    Applicant: XEROX CORPORATION
    Inventor: Herve Dejean
  • Patent number: 8560937
    Abstract: A system, method, and computer program product for segmenting a document are disclosed. The method considers a zone of a document, such as a page frame or other zone which is a predetermined ratio thereof, and while there are remaining elements in the zone, iteratively tests different segmentations of the zone into n candidate columns, and computes a width of a gutter for each n-candidate. Assuming that the gutter width computed meets a threshold test, which may be based on the arrangement of the elements in the columns, and the candidate columns for the n-candidate each contain at least a threshold number of elements, elements are assigned to respective ones of n segmented columns within which they are located. For example, line elements are arranged in blocks of text within the columns, enabling a reading order for sequences of text, such as complete sentences and paragraphs, to be computed.
    Type: Grant
    Filed: June 7, 2011
    Date of Patent: October 15, 2013
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Patent number: 8478046
    Abstract: A system and method for detection of signature marks in documents are provided. The method includes selecting candidate text objects in document pages and identifying a sequence of elements therein. The sequence has a numbering pattern including an incremental part and optionally a fixed part. Missing elements between two detected elements of the sequence are permitted. For an identified sequence, a model of the sequence is generated, which includes the numbering pattern of the sequence, an increment, which is computed based on the distance between pages on which consecutive elements of the sequence are identified, a valid sequence having an increment of greater than 1, and a first page, which corresponds to a page of the document on which the sequence starts. The sequence is then validated with the model, allowing elements of the sequence in the pages of the document to be identified as signature marks.
    Type: Grant
    Filed: November 3, 2011
    Date of Patent: July 2, 2013
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Publication number: 20130114914
    Abstract: A system and method for detection of signature marks in documents are provided. The method includes selecting candidate text objects in document pages and identifying a sequence of elements therein. The sequence has a numbering pattern including an incremental part and optionally a fixed part. Missing elements between two detected elements of the sequence are permitted. For an identified sequence, a model of the sequence is generated, which includes the numbering pattern of the sequence, an increment, which is computed based on the distance between pages on which consecutive elements of the sequence are identified, a valid sequence having an increment of greater than 1, and a first page, which corresponds to a page of the document on which the sequence starts. The sequence is then validated with the model, allowing elements of the sequence in the pages of the document to be identified as signature marks.
    Type: Application
    Filed: November 3, 2011
    Publication date: May 9, 2013
    Applicant: XEROX CORPORATION
    Inventor: Hervé Déjean
  • Patent number: 8352857
    Abstract: Reference identification and resolution identifies reference text fragments in a document and associates referenced object text fragments in the document with the identified reference text fragments. Reference profiles are abstracted from the document. Each reference profile specifies at least a reference number and an object type identifier. A reference profile is paired with an object text fragment of the document containing the reference number of the reference profile. The pairing is repeated to associate reference profiles with object text fragments. A reference text fragment of the document satisfying one of the reference profiles is associated with the object text fragment paired with the satisfied reference profile. The associating is repeated to associate reference text fragments of the document with object text fragments.
    Type: Grant
    Filed: October 27, 2008
    Date of Patent: January 8, 2013
    Assignee: Xerox Corporation
    Inventors: Katja Filippova, Herve Dejean
  • Patent number: 8340425
    Abstract: An image of a paginated document is zoned to identify text zones. First-pass character recognition is performed on the text zones to generate textual content corresponding to the paginated document. The image of the paginated document is re-zoned based on the textual content to identify one or more new text zones. Second-pass character recognition is performed on at least the new text zones to generate updated textual content corresponding to the paginated document.
    Type: Grant
    Filed: August 10, 2010
    Date of Patent: December 25, 2012
    Assignee: Xerox Corporation
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Publication number: 20120324341
    Abstract: A method and a system for detecting and extracting images in an electronic document are disclosed. The method includes receiving an electronic document comprising a plurality of pages and, for each of at least one of the pages of the document, identifying elements of the page. The identified elements include a set of graphical elements and a set of text elements. The method may include identifying and excluding, from the set of graphical elements, those which serve as graphical page constructs and/or text formatting elements. The page can then be segmented, based on (remaining) graphical elements and identified white spaces, to generate a set of image blocks, each including a respective one or more of the graphical elements. Text elements that are associated with a respective image block are identified as captions. Overlapping candidate images, each including an image block and its caption(s), if any, are then grouped to form a new image.
    Type: Application
    Filed: June 17, 2011
    Publication date: December 20, 2012
    Applicant: Xerox Corporation
    Inventor: Hervé Déjean
  • Publication number: 20120317470
    Abstract: A system, method, and computer program product for segmenting a document are disclosed. The method considers a zone of a document, such as a page frame or other zone which is a predetermined ratio thereof, and while there are remaining elements in the zone, iteratively tests different segmentations of the zone into n candidate columns, and computes a width of a gutter for each n-candidate. Assuming that the gutter width computed meets a threshold test, which may be based on the arrangement of the elements in the columns, and the candidate columns for the n-candidate each contain at least a threshold number of elements, elements are assigned to respective ones of n segmented columns within which they are located. For example, line elements are arranged in blocks of text within the columns, enabling a reading order for sequences of text, such as complete sentences and paragraphs, to be computed.
    Type: Application
    Filed: June 7, 2011
    Publication date: December 13, 2012
    Applicant: Xerox Corporation
    Inventor: Hervé Déjean
  • Patent number: 8302002
    Abstract: A document is organized as a plurality of nodes associated with a table of contents. The nodes are clustered into a plurality of clusters based on a similarity criterion. One of the clusters is identified as corresponding to a highest or lowest level of the table of contents based on a selection criterion. The highest or lowest level is assigned to the nodes belonging to the identified cluster. The identifying and assigning are repeated to assign levels to the nodes belonging to each next highest or lowest level of the table of contents. The repeated identifying is based on the selection criteria applied disregarding nodes that have already been assigned a level. The document is structured based at least in part on the levels assigned to the table of contents nodes.
    Type: Grant
    Filed: April 27, 2005
    Date of Patent: October 30, 2012
    Assignee: Xerox Corporation
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Publication number: 20120159313
    Abstract: A system, method, and computer program product for determining the structure of a document are provided. The method includes receiving a set of document pages for a document and linking one page frame to each of a plurality of document pages in the set. For each document page linked to a page frame, a content bounding box surrounding the content on the document page is identified, and the document page categorized, based at least in part on the geometrical relationship between the page frame and the content bounding box of the document page. The document page can then be identified as a logical cut based at least in part on the categorization of the document page. Information, such as a table of contents or updated table of contents, can then be output, based on the determined logical unit(s) of the document.
    Type: Application
    Filed: December 21, 2010
    Publication date: June 21, 2012
    Applicant: XEROX CORPORATION
    Inventor: Hervé Déjean
  • Publication number: 20120079370
    Abstract: A system and method for page frame detection for pages of a document are disclosed. The method includes receiving a set of document pages for a document, each page having at least one detected object. For each page in the set, the method includes determining dimensions of bounding box which encompasses the detected objects of the page and determining margin dimensions, based on a position of the bounding box on the page. A page frame is computed as a combination of bounding box dimensions and margin dimensions, based on frequencies of the bounding box dimensions and margin dimensions computed for the set of pages. The computed page frame is matched to pages of the document. Information based on the matching, such as content of text objects within the matched page frame, can be output.
    Type: Application
    Filed: September 28, 2010
    Publication date: March 29, 2012
    Applicant: Xerox Corporation
    Inventor: Hervé Déjean
  • Publication number: 20120039536
    Abstract: An image of a paginated document is zoned to identify text zones. First-pass character recognition is performed on the text zones to generate textual content corresponding to the paginated document. The image of the paginated document is re-zoned based on the textual content to identify one or more new text zones. Second-pass character recognition is performed on at least the new text zones to generate updated textual content corresponding to the paginated document.
    Type: Application
    Filed: August 10, 2010
    Publication date: February 16, 2012
    Applicant: XEROX CORPORATION
    Inventors: Hervé Déjean, Jean-Luc Meunier