Patents by Inventor Hervé Déjéan

Hervé Déjéan has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20110276874
    Abstract: A computer-implemented method and system for generation of page templates are provided. The method includes providing a document in computer memory. Using a computer processor, page elements within the document are identified and labeled. For each page of the document, a set of geometric relations between pairs of page elements co-occurring on the page is computed, and the set of geometric relations is associated with the page. The method also includes generating a set of page template candidates based at least in part on the computed geometric relations, selecting page templates from the set of page template candidates, and outputting the selected page templates.
    Type: Application
    Filed: May 4, 2010
    Publication date: November 10, 2011
    Applicant: Xerox Corporation
    Inventor: Hervé Déjean
  • Patent number: 8023740
    Abstract: To perform notes detection, candidate reference marks are identified in a document. A starting note zone is identified in the document. A pair of similar reference marks is identified from the candidate reference marks including a first reference mark in the note zone and a second reference mark outside the note zone. The document is marked up to indicate a note associated with the first and second reference marks.
    Type: Grant
    Filed: August 13, 2007
    Date of Patent: September 20, 2011
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Patent number: 7991709
    Abstract: A method is provided for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.
    Type: Grant
    Filed: January 28, 2008
    Date of Patent: August 2, 2011
    Assignee: Xerox Corporation
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Publication number: 20110145701
    Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.
    Type: Application
    Filed: February 23, 2011
    Publication date: June 16, 2011
    Applicant: XEROX CORPORATION
    Inventors: Hervé Déjean, Jean-Luc Meunier
  • Patent number: 7937653
    Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.
    Type: Grant
    Filed: January 10, 2005
    Date of Patent: May 3, 2011
    Assignee: Xerox Corporation
    Inventors: Hervé Déjean, Jean-Luc Meunier
  • Patent number: 7852499
    Abstract: To detect captions in a document that includes text fragments and objects of interest, a signature is assigned to each text fragment. The signature is the value for that text fragment of a text fragment representation comprising at least one text fragment attribute. A caption signature is identified as a signature assigned to a substantial number of text fragments that are near at least one object of interest in the document. One or more captions are detected as one or more text fragments each assigned a caption signature.
    Type: Grant
    Filed: September 27, 2006
    Date of Patent: December 14, 2010
    Assignee: Xerox Corporation
    Inventor: Hervé Déjean
  • Publication number: 20100306260
    Abstract: Numbered sequences detection includes (i) extracting one or more numbered item token patterns from a document comprising an ordered sequence of text units, each numbered item token pattern including an incremental portion and a fixed portion that matches at least one text unit of the document and (ii) identifying at least one numbered sequence in the document conforming with a matching numbered item token pattern of the extracted one or more numbered item token patterns. The identified at least one numbered sequence comprises an ordered sub-sequence of text units of the document that match the matching numbered item token pattern. The detection may further comprise determining that a second type of numbered sequence nests in the document between consecutive text units belonging to a numbered sequence of a first type, and optimizing one or more numbered sequences of the second type based on information provided by the determining.
    Type: Application
    Filed: May 29, 2009
    Publication date: December 2, 2010
    Applicant: Xerox Corporation
    Inventor: Herve Dejean
  • Patent number: 7827484
    Abstract: To correct at least one extraneous or missing space in a document, weights are assigned to tokens contained in a dictionary. Each token is defined by an ordered sequence of non-space symbols. The weights are assigned based on at least one of a token length and frequency of occurrence of the token in the document. Corrected text is generated from text of the document by applying an ordered sequence of symbol-level transformations selected from a group of symbol-level transformations including at least (i) deleting a space, (ii) inserting a space, and (iii) copying a symbol. The ordered sequence of symbol-level transformations is optimized respective to an objective function dependent upon the weights of tokens of the corrected text.
    Type: Grant
    Filed: September 2, 2005
    Date of Patent: November 2, 2010
    Assignee: Xerox Corporation
    Inventors: Hervé Déjean, André Kempe
  • Patent number: 7826665
    Abstract: In a system for updating a contacts database (42, 46), a portable imager (12) acquires a digital business card image (10). An image segmenter (16) extracts text image segments from the digital business card image. An optical character recognizer (OCR) (26) generates one or more textual content candidates for each text image segment. A scoring processor (36) scores each textual content candidate based on results of database queries respective to the textual content candidates. A content selector (38) selects a textual content candidate for each text image segment based at least on the assigned scores. An interface (50) is configured to update the contacts list based on the selected textual content candidates.
    Type: Grant
    Filed: December 12, 2005
    Date of Patent: November 2, 2010
    Assignee: Xerox Corporation
    Inventors: Marco Bressan, Hervé Dejean, Christopher R. Dance
  • Patent number: 7797622
    Abstract: A method for detection of page numbers in a document includes identifying a plurality of text fragments associated with a plurality of pages of a document. From the identified text fragments, at least one sequence is identified. Each identified sequence includes a plurality of terms. Each term of the sequence is derived from a text fragment selected from the plurality text fragments. The terms of an identified sequence comply with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. A subset of the identified sequences which cover at least some of the pages of the document is computed. Terms of at least some of the subset of the identified sequences are construed as page numbers of pages of the document. Additional page numbers may be identified by considering one or more features of the terms in the subset of identified sequences.
    Type: Grant
    Filed: November 15, 2006
    Date of Patent: September 14, 2010
    Assignee: Xerox Corporation
    Inventors: Hervé Déjean, Jean-Luc Meunier
  • Patent number: 7788085
    Abstract: String replacement is performed in text using linguistic processing. The linguistic processing identifies the existence of direct or indirect links between the string to be replaced and other strings in the text. Morphological, syntactic, anaphoric, or semantic inconsistencies, which are introduced in strings with the identified direct or indirect links to the string that is to be replaced are detected and corrected.
    Type: Grant
    Filed: December 17, 2004
    Date of Patent: August 31, 2010
    Assignee: Xerox Corporation
    Inventors: Caroline Brun, Herve Dejean, Caroline Hagege
  • Patent number: 7743327
    Abstract: In a method for identifying a table of contents in a document (10), text fragments are extracted (12) from the document. There are identified (20, 30, 34, 38): (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion (130). The identified table of contents entries and linked text fragments (110) are validated based on at least one validation criterion (162) related to distribution of the linked text fragments.
    Type: Grant
    Filed: February 23, 2006
    Date of Patent: June 22, 2010
    Assignee: Xerox Corporation
    Inventors: Jean-Luc Meunier, Hervé Déjean
  • Patent number: 7730396
    Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve decomposing the input document, labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.
    Type: Grant
    Filed: November 13, 2006
    Date of Patent: June 1, 2010
    Assignee: Xerox Corporation
    Inventors: Boris Chidlovskii, Hervé Dejean
  • Publication number: 20100107045
    Abstract: Reference identification and resolution identifies reference text fragments in a document and associates referenced object text fragments in the document with the identified reference text fragments. Reference profiles are abstracted from the document. Each reference profile specifies at least a reference number and an object type identifier. A reference profile is paired with an object text fragment of the document containing the reference number of the reference profile. The pairing is repeated to associate reference profiles with object text fragments. A reference text fragment of the document satisfying one of the reference profiles is associated with the object text fragment paired with the satisfied reference profile. The associating is repeated to associate reference text fragments of the document with object text fragments.
    Type: Application
    Filed: October 27, 2008
    Publication date: April 29, 2010
    Applicant: XEROX CORPORATION
    Inventors: Katja Filippova, Herve Dejean
  • Patent number: 7693848
    Abstract: A method and apparatus is provided for converting a document in a first format essentially comprising a flat layout structure into a structured document in a hierarchical form in accordance with predetermined attributes identified from the input format. The process comprises fragmenting the input document into a plurality of document content elements in accordance with a predetermined set of document attributes identifiable from the input document format. The content elements are clustered into selective sets having similar document attributes. The clustered sets are validated with reference to common textual properties organizational content common in documents in the collection. The clustered sets are then categorized into predetermined categories comprising structured elements of the structured document format and the document content elements are organized by hierarchical dependency from the predetermined categories wherein the organized document elements comprise the desired structured document format.
    Type: Grant
    Filed: January 10, 2005
    Date of Patent: April 6, 2010
    Assignee: Xerox Corporation
    Inventors: Hervé Déjean, Veronika Lux, Sandrine Ribeau
  • Patent number: 7620539
    Abstract: Various methods formulated using a geometric interpretation for identifying bilingual pairs in comparable corpora using a bilingual dictionary are disclosed. The methods may be used separately or in combination to compute the similarity between bilingual pairs.
    Type: Grant
    Filed: November 1, 2004
    Date of Patent: November 17, 2009
    Assignee: Xerox Corporation
    Inventors: Eric Gaussier, Jean-Michel Renders, Herve Dejean, Cyril Goutte, Irina Matveeva
  • Publication number: 20090192956
    Abstract: A method is provided for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.
    Type: Application
    Filed: January 28, 2008
    Publication date: July 30, 2009
    Applicant: XEROX CORPORATION
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Publication number: 20090110268
    Abstract: An initial organizational table for a document is determined based on textual similarity between entries of the organizational table and target text fragments and not taking into account text formatting. A classifier is trained to identify text fragment pairs consisting of entries of the organizational table and corresponding target text fragments based at least in part on text formatting features. The training employs a training set of examples annotated based on the initial organizational table. The initial organizational table is updated using the trained classifier.
    Type: Application
    Filed: October 25, 2007
    Publication date: April 30, 2009
    Applicant: XEROX CORPORATION
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Publication number: 20090046918
    Abstract: To perform notes detection, candidate reference marks are identified in a document. A starting note zone is identified in the document. A pair of similar reference marks is identified from the candidate reference marks including a first reference mark in the note zone and a second reference mark outside the note zone. The document is marked up to indicate a note associated with the first and second reference marks.
    Type: Application
    Filed: August 13, 2007
    Publication date: February 19, 2009
    Inventor: Herve Dejean
  • Publication number: 20080114757
    Abstract: A method for detection of page numbers in a document includes identifying a plurality of text fragments associated with a plurality of pages of a document. From the identified text fragments, at least one sequence is identified. Each identified sequence includes a plurality of terms. Each term of the sequence is derived from a text fragment selected from the plurality text fragments. The terms of an identified sequence comply with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. A subset of the identified sequences which cover at least some of the pages of the document is computed. Terms of at least some of the subset of the identified sequences are construed as page numbers of pages of the document. Additional page numbers may be identified by considering one or more features of the terms in the subset of identified sequences.
    Type: Application
    Filed: November 15, 2006
    Publication date: May 15, 2008
    Inventors: Herve Dejean, Jean-Luc Meunier