Patents by Inventor Hervé Déjéan

Hervé Déjéan has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

SYSTEM AND METHOD FOR UNSUPERVISED GENERATION OF PAGE TEMPLATES

Publication number: 20110276874

Abstract: A computer-implemented method and system for generation of page templates are provided. The method includes providing a document in computer memory. Using a computer processor, page elements within the document are identified and labeled. For each page of the document, a set of geometric relations between pairs of page elements co-occurring on the page is computed, and the set of geometric relations is associated with the page. The method also includes generating a set of page template candidates based at least in part on the computed geometric relations, selecting page templates from the set of page template candidates, and outputting the selected page templates.

Type: Application

Filed: May 4, 2010

Publication date: November 10, 2011

Applicant: Xerox Corporation

Inventor: Hervé Déjean
Systems and methods for notes detection

Patent number: 8023740

Abstract: To perform notes detection, candidate reference marks are identified in a document. A starting note zone is identified in the document. A pair of similar reference marks is identified from the candidate reference marks including a first reference mark in the note zone and a second reference mark outside the note zone. The document is marked up to indicate a note associated with the first and second reference marks.

Type: Grant

Filed: August 13, 2007

Date of Patent: September 20, 2011

Assignee: Xerox Corporation

Inventor: Hervé Déjean
Method and apparatus for structuring documents utilizing recognition of an ordered sequence of identifiers

Patent number: 7991709

Abstract: A method is provided for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.

Type: Grant

Filed: January 28, 2008

Date of Patent: August 2, 2011

Assignee: Xerox Corporation

Inventors: Herve Dejean, Jean-Luc Meunier
METHOD AND APPARATUS FOR DETECTING PAGINATION CONSTRUCTS INCLUDING A HEADER AND A FOOTER IN LEGACY DOCUMENTS

Publication number: 20110145701

Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.

Type: Application

Filed: February 23, 2011

Publication date: June 16, 2011

Applicant: XEROX CORPORATION

Inventors: Hervé Déjean, Jean-Luc Meunier
Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents

Patent number: 7937653

Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.

Type: Grant

Filed: January 10, 2005

Date of Patent: May 3, 2011

Assignee: Xerox Corporation

Inventors: Hervé Déjean, Jean-Luc Meunier
Captions detector

Patent number: 7852499

Abstract: To detect captions in a document that includes text fragments and objects of interest, a signature is assigned to each text fragment. The signature is the value for that text fragment of a text fragment representation comprising at least one text fragment attribute. A caption signature is identified as a signature assigned to a substantial number of text fragments that are near at least one object of interest in the document. One or more captions are detected as one or more text fragments each assigned a caption signature.

Type: Grant

Filed: September 27, 2006

Date of Patent: December 14, 2010

Assignee: Xerox Corporation

Inventor: Hervé Déjean
NUMBER SEQUENCES DETECTION SYSTEMS AND METHODS

Publication number: 20100306260

Abstract: Numbered sequences detection includes (i) extracting one or more numbered item token patterns from a document comprising an ordered sequence of text units, each numbered item token pattern including an incremental portion and a fixed portion that matches at least one text unit of the document and (ii) identifying at least one numbered sequence in the document conforming with a matching numbered item token pattern of the extracted one or more numbered item token patterns. The identified at least one numbered sequence comprises an ordered sub-sequence of text units of the document that match the matching numbered item token pattern. The detection may further comprise determining that a second type of numbered sequence nests in the document between consecutive text units belonging to a numbered sequence of a first type, and optimizing one or more numbered sequences of the second type based on information provided by the determining.

Type: Application

Filed: May 29, 2009

Publication date: December 2, 2010

Applicant: Xerox Corporation

Inventor: Herve Dejean
Text correction for PDF converters

Patent number: 7827484

Abstract: To correct at least one extraneous or missing space in a document, weights are assigned to tokens contained in a dictionary. Each token is defined by an ordered sequence of non-space symbols. The weights are assigned based on at least one of a token length and frequency of occurrence of the token in the document. Corrected text is generated from text of the document by applying an ordered sequence of symbol-level transformations selected from a group of symbol-level transformations including at least (i) deleting a space, (ii) inserting a space, and (iii) copying a symbol. The ordered sequence of symbol-level transformations is optimized respective to an objective function dependent upon the weights of tokens of the corrected text.

Type: Grant

Filed: September 2, 2005

Date of Patent: November 2, 2010

Assignee: Xerox Corporation

Inventors: Hervé Déjean, André Kempe
Personal information retrieval using knowledge bases for optical character recognition correction

Patent number: 7826665

Abstract: In a system for updating a contacts database (42, 46), a portable imager (12) acquires a digital business card image (10). An image segmenter (16) extracts text image segments from the digital business card image. An optical character recognizer (OCR) (26) generates one or more textual content candidates for each text image segment. A scoring processor (36) scores each textual content candidate based on results of database queries respective to the textual content candidates. A content selector (38) selects a textual content candidate for each text image segment based at least on the assigned scores. An interface (50) is configured to update the contacts list based on the selected textual content candidates.

Type: Grant

Filed: December 12, 2005

Date of Patent: November 2, 2010

Assignee: Xerox Corporation

Inventors: Marco Bressan, Hervé Dejean, Christopher R. Dance
Versatile page number detector

Patent number: 7797622

Abstract: A method for detection of page numbers in a document includes identifying a plurality of text fragments associated with a plurality of pages of a document. From the identified text fragments, at least one sequence is identified. Each identified sequence includes a plurality of terms. Each term of the sequence is derived from a text fragment selected from the plurality text fragments. The terms of an identified sequence comply with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. A subset of the identified sequences which cover at least some of the pages of the document is computed. Terms of at least some of the subset of the identified sequences are construed as page numbers of pages of the document. Additional page numbers may be identified by considering one or more features of the terms in the subset of identified sequences.

Type: Grant

Filed: November 15, 2006

Date of Patent: September 14, 2010

Assignee: Xerox Corporation

Inventors: Hervé Déjean, Jean-Luc Meunier
Smart string replacement

Patent number: 7788085

Abstract: String replacement is performed in text using linguistic processing. The linguistic processing identifies the existence of direct or indirect links between the string to be replaced and other strings in the text. Morphological, syntactic, anaphoric, or semantic inconsistencies, which are introduced in strings with the identified direct or indirect links to the string that is to be replaced are detected and corrected.

Type: Grant

Filed: December 17, 2004

Date of Patent: August 31, 2010

Assignee: Xerox Corporation

Inventors: Caroline Brun, Herve Dejean, Caroline Hagege
Table of contents extraction with improved robustness

Patent number: 7743327

Abstract: In a method for identifying a table of contents in a document (10), text fragments are extracted (12) from the document. There are identified (20, 30, 34, 38): (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion (130). The identified table of contents entries and linked text fragments (110) are validated based on at least one validation criterion (162) related to distribution of the linked text fragments.

Type: Grant

Filed: February 23, 2006

Date of Patent: June 22, 2010

Assignee: Xerox Corporation

Inventors: Jean-Luc Meunier, Hervé Déjean
Systems and methods for converting legacy and proprietary documents into extended mark-up language format

Patent number: 7730396

Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve decomposing the input document, labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.

Type: Grant

Filed: November 13, 2006

Date of Patent: June 1, 2010

Assignee: Xerox Corporation

Inventors: Boris Chidlovskii, Hervé Dejean
METHODS AND APPARATUSES FOR INTRA-DOCUMENT REFERENCE IDENTIFICATION AND RESOLUTION

Publication number: 20100107045

Abstract: Reference identification and resolution identifies reference text fragments in a document and associates referenced object text fragments in the document with the identified reference text fragments. Reference profiles are abstracted from the document. Each reference profile specifies at least a reference number and an object type identifier. A reference profile is paired with an object text fragment of the document containing the reference number of the reference profile. The pairing is repeated to associate reference profiles with object text fragments. A reference text fragment of the document satisfying one of the reference profiles is associated with the object text fragment paired with the satisfied reference profile. The associating is repeated to associate reference text fragments of the document with object text fragments.

Type: Application

Filed: October 27, 2008

Publication date: April 29, 2010

Applicant: XEROX CORPORATION

Inventors: Katja Filippova, Herve Dejean
Method and apparatus for structuring documents based on layout, content and collection

Patent number: 7693848

Abstract: A method and apparatus is provided for converting a document in a first format essentially comprising a flat layout structure into a structured document in a hierarchical form in accordance with predetermined attributes identified from the input format. The process comprises fragmenting the input document into a plurality of document content elements in accordance with a predetermined set of document attributes identifiable from the input document format. The content elements are clustered into selective sets having similar document attributes. The clustered sets are validated with reference to common textual properties organizational content common in documents in the collection. The clustered sets are then categorized into predetermined categories comprising structured elements of the structured document format and the document content elements are organized by hierarchical dependency from the predetermined categories wherein the organized document elements comprise the desired structured document format.

Type: Grant

Filed: January 10, 2005

Date of Patent: April 6, 2010

Assignee: Xerox Corporation

Inventors: Hervé Déjean, Veronika Lux, Sandrine Ribeau
Methods and apparatuses for identifying bilingual lexicons in comparable corpora using geometric processing

Patent number: 7620539

Abstract: Various methods formulated using a geometric interpretation for identifying bilingual pairs in comparable corpora using a bilingual dictionary are disclosed. The methods may be used separately or in combination to compute the similarity between bilingual pairs.

Type: Grant

Filed: November 1, 2004

Date of Patent: November 17, 2009

Assignee: Xerox Corporation

Inventors: Eric Gaussier, Jean-Michel Renders, Herve Dejean, Cyril Goutte, Irina Matveeva
METHOD AND APPARATUS FOR STRUCTURING DOCUMENTS UTILIZING RECOGNITION OF AN ORDERED SEQUENCE OF IDENTIFIERS

Publication number: 20090192956

Abstract: A method is provided for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.

Type: Application

Filed: January 28, 2008

Publication date: July 30, 2009

Applicant: XEROX CORPORATION

Inventors: Herve Dejean, Jean-Luc Meunier
TABLE OF CONTENTS EXTRACTION BASED ON TEXTUAL SIMILARITY AND FORMAL ASPECTS

Publication number: 20090110268

Abstract: An initial organizational table for a document is determined based on textual similarity between entries of the organizational table and target text fragments and not taking into account text formatting. A classifier is trained to identify text fragment pairs consisting of entries of the organizational table and corresponding target text fragments based at least in part on text formatting features. The training employs a training set of examples annotated based on the initial organizational table. The initial organizational table is updated using the trained classifier.

Type: Application

Filed: October 25, 2007

Publication date: April 30, 2009

Applicant: XEROX CORPORATION

Inventors: Herve Dejean, Jean-Luc Meunier
Systems and methods for notes detection

Publication number: 20090046918

Abstract: To perform notes detection, candidate reference marks are identified in a document. A starting note zone is identified in the document. A pair of similar reference marks is identified from the candidate reference marks including a first reference mark in the note zone and a second reference mark outside the note zone. The document is marked up to indicate a note associated with the first and second reference marks.

Type: Application

Filed: August 13, 2007

Publication date: February 19, 2009

Inventor: Herve Dejean
Versatile page number detector

Publication number: 20080114757

Abstract: A method for detection of page numbers in a document includes identifying a plurality of text fragments associated with a plurality of pages of a document. From the identified text fragments, at least one sequence is identified. Each identified sequence includes a plurality of terms. Each term of the sequence is derived from a text fragment selected from the plurality text fragments. The terms of an identified sequence comply with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. A subset of the identified sequences which cover at least some of the pages of the document is computed. Terms of at least some of the subset of the identified sequences are construed as page numbers of pages of the document. Additional page numbers may be identified by considering one or more features of the terms in the subset of identified sequences.

Type: Application

Filed: November 15, 2006

Publication date: May 15, 2008

Inventors: Herve Dejean, Jean-Luc Meunier

prev 1 2 3 4 next