Patents by Inventor Herve Dejean

Herve Dejean has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20080077847
    Abstract: To detect captions in a document that includes text fragments and objects of interest, a signature is assigned to each text fragment. The signature is the value for that text fragment of a text fragment representation comprising at least one text fragment attribute. A caption signature is identified as a signature assigned to a substantial number of text fragments that are near at least one object of interest in the document. One or more captions are detected as one or more text fragments each assigned a caption signature.
    Type: Application
    Filed: September 27, 2006
    Publication date: March 27, 2008
    Inventor: Herve Dejean
  • Publication number: 20080065671
    Abstract: A document (10) includes one or more organizational tables (40). Each organizational table includes a substantially contiguous sub-set of text fragments of the document identified as entries of the organizational table, and each entry has an associated linked text fragment. An organizational tables scorer (42) assigns a score to each of the one or more organizational tables respective to at least one object type based on a scoring criterion for that object type. An organizational tables labeler (44) assigns a table type label to each of the one or more organizational tables based on the scores.
    Type: Application
    Filed: September 7, 2006
    Publication date: March 13, 2008
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Patent number: 7296223
    Abstract: A method for creating a structured document, wherein a structured document comprises a plurality of content elements wrapped in pairs of tags, includes parsing a document of a particular type containing content into a plurality of content elements; and for each content element, suggesting an optimal tag according to a tag suggestion procedure. The tag suggestion procedure includes providing sample data which has been converted into a structured sample document; deriving a set of tags from the structured sample document; evaluating the set of tags according to tag suggestion criteria to determine an optimal tag for the content element. The optimal tag may be a single tag or a pattern of tags which maximizes a similarity function with patterns found in the sample data.
    Type: Grant
    Filed: June 27, 2003
    Date of Patent: November 13, 2007
    Assignee: Xerox Corporation
    Inventors: Boris Chidlovskii, Hervé Déjean
  • Publication number: 20070196015
    Abstract: In a method for identifying a table of contents in a document (10), text fragments are extracted (12) from the document. There are identified (20, 30, 34, 38): (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion (130). The identified table of contents entries and linked text fragments (110) are validated based on at least one validation criterion (162) related to distribution of the linked text fragments.
    Type: Application
    Filed: February 23, 2006
    Publication date: August 23, 2007
    Inventors: Jean-Luc Meunier, Herve Dejean
  • Publication number: 20070133874
    Abstract: In a system for updating a contacts database (42, 46), a portable imager (12) acquires a digital business card image (10). An image segmenter (16) extracts text image segments from the digital business card image. An optical character recognizer (OCR) (26) generates one or more textual content candidates for each text image segment. A scoring processor (36) scores each textual content candidate based on results of database queries respective to the textual content candidates. A content selector (38) selects a textual content candidate for each text image segment based at least on the assigned scores. An interface (50) is configured to update the contacts list based on the selected textual content candidates.
    Type: Application
    Filed: December 12, 2005
    Publication date: June 14, 2007
    Inventors: Marco Bressan, Herve Dejean, Christopher Dance
  • Publication number: 20070094201
    Abstract: In a rule induction method, an overbroad candidate rule is selected for categorizing a node to be categorized. The candidate rule is specialized by: (i) adding a rule node corresponding to a node level of structured training examples; (ii) including in a rule node a rule pertaining to an attribute of at least one node of the corresponding node level to produce a specialized candidate rule; and (iii) evaluating the specialized candidate rule respective to the structured training examples.
    Type: Application
    Filed: September 23, 2005
    Publication date: April 26, 2007
    Inventor: Herve Dejean
  • Publication number: 20070061713
    Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve decomposing the input document, labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.
    Type: Application
    Filed: November 13, 2006
    Publication date: March 15, 2007
    Applicant: Xerox Corporation
    Inventors: Boris Chidlovskii, Herve Dejean
  • Publication number: 20070055933
    Abstract: To correct at least one extraneous or missing space in a document, weights are assigned to tokens contained in a dictionary. Each token is defined by an ordered sequence of non-space symbols. The weights are assigned based on at least one of a token length and frequency of occurrence of the token in the document. Corrected text is generated from text of the document by applying an ordered sequence of symbol-level transformations selected from a group of symbol-level transformations including at least (i) deleting a space, (ii) inserting a space, and (iii) copying a symbol. The ordered sequence of symbol-level transformations is optimized respective to an objective function dependent upon the weights of tokens of the corrected text.
    Type: Application
    Filed: September 2, 2005
    Publication date: March 8, 2007
    Inventors: Herve Dejean, Andre Kempe
  • Patent number: 7165216
    Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.
    Type: Grant
    Filed: January 14, 2004
    Date of Patent: January 16, 2007
    Assignee: Xerox Corporation
    Inventors: Boris Chidlovskii, Herve Dejean
  • Publication number: 20060248070
    Abstract: A document is organized as a plurality of nodes associated with a table of contents. The nodes are clustered into a plurality of clusters based on a similarity criterion. One of the clusters is identified as corresponding to a highest or lowest level of the table of contents based on a selection criterion. The highest or lowest level is assigned to the nodes belonging to the identified cluster. The identifying and assigning are repeated to assign levels to the nodes belonging to each next highest or lowest level of the table of contents. The repeated identifying is based on the selection criteria applied disregarding nodes that have already been assigned a level. The document is structured based at least in part on the levels assigned to the table of contents nodes.
    Type: Application
    Filed: April 27, 2005
    Publication date: November 2, 2006
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Publication number: 20060156226
    Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.
    Type: Application
    Filed: January 10, 2005
    Publication date: July 13, 2006
    Inventors: Herve Dejean, Jean-Luc Meunier
  • Publication number: 20060155703
    Abstract: In a method for identifying a table of contents in a document, an ordered sequence of text fragments is derived from the document. A table of contents is selected as a contiguous sub-sequence of the ordered sequence of text fragments satisfying the criteria: (i) entries defined by text fragments of the table of contents each have a link to a target text fragment having textual similarity with the entry; (ii) no target text fragment lies within the table of contents; and (iii) the target text fragments have an ascending ordering corresponding to an ascending ordering of the entries defining the target text fragments.
    Type: Application
    Filed: January 10, 2005
    Publication date: July 13, 2006
    Inventors: Herve Dejean, Jean-Luc Meunier, Olivier Fambon
  • Publication number: 20060155700
    Abstract: A method and apparatus is provided for converting a document in a first format essentially comprising a flat layout structure into a structured document in a hierarchical form in accordance with predetermined attributes identified from the input format. The process comprises fragmenting the input document into a plurality of document content elements in accordance with a predetermined set of document attributes identifiable from the input document format. The content elements are clustered into selective sets having similar document attributes. The clustered sets are validated with reference to common textual properties organizational content common in documents in the collection. The clustered sets are then categorized into predetermined categories comprising structured elements of the structured document format and the document content elements are organized by hierarchical dependency from the predetermined categories wherein the organized document elements comprise the desired structured document format.
    Type: Application
    Filed: January 10, 2005
    Publication date: July 13, 2006
    Inventors: Herve Dejean, Veronika Lux, Sandrine Ribeau
  • Publication number: 20060136352
    Abstract: String replacement is performed in text using linguistic processing. The linguistic processing identifies the existence of direct or indirect links between the string to be replaced and other strings in the text. Morphological, syntactic, anaphoric, or semantic inconsistencies, which are introduced in strings with the identified direct or indirect links to the string that is to be replaced are detected and corrected.
    Type: Application
    Filed: December 17, 2004
    Publication date: June 22, 2006
    Inventors: Caroline Brun, Herve Dejean, Caroline Hagega
  • Publication number: 20060009963
    Abstract: Various methods formulated using a geometric interpretation for identifying bilingual pairs in comparable corpora using a bilingual dictionary are disclosed. The methods may be used separately or in combination to compute the similarity between bilingual pairs.
    Type: Application
    Filed: November 1, 2004
    Publication date: January 12, 2006
    Inventors: Eric Gaussier, Jean-Michel Renders, Herve Dejean, Cyril Goutte, Irina Matveeva
  • Publication number: 20050154979
    Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.
    Type: Application
    Filed: January 14, 2004
    Publication date: July 14, 2005
    Applicant: XEROX CORPORATION
    Inventors: Boris Chidlovskii, Herve Dejean
  • Publication number: 20040268236
    Abstract: A method for creating a structured document, wherein a structured document comprises a plurality of content elements wrapped in pairs of tags, includes parsing a document of a particular type containing content into a plurality of content elements; and for each content element, suggesting an optimal tag according to a tag suggestion procedure. The tag suggestion procedure includes providing sample data which has been converted into a structured sample document; deriving a set of tags from the structured sample document; evaluating the set of tags according to tag suggestion criteria to determine an optimal tag for the content element. The optimal tag may be a single tag or a pattern of tags which maximizes a similarity function with patterns found in the sample data.
    Type: Application
    Filed: June 27, 2003
    Publication date: December 30, 2004
    Applicant: Xerox Corporation
    Inventors: Boris Chidlovskii, Herve Dejean