Patents by Inventor Herve Dejean

Herve Dejean has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Captions detector

Publication number: 20080077847

Abstract: To detect captions in a document that includes text fragments and objects of interest, a signature is assigned to each text fragment. The signature is the value for that text fragment of a text fragment representation comprising at least one text fragment attribute. A caption signature is identified as a signature assigned to a substantial number of text fragments that are near at least one object of interest in the document. One or more captions are detected as one or more text fragments each assigned a caption signature.

Type: Application

Filed: September 27, 2006

Publication date: March 27, 2008

Inventor: Herve Dejean
Methods and apparatuses for detecting and labeling organizational tables in a document

Publication number: 20080065671

Abstract: A document (10) includes one or more organizational tables (40). Each organizational table includes a substantially contiguous sub-set of text fragments of the document identified as entries of the organizational table, and each entry has an associated linked text fragment. An organizational tables scorer (42) assigns a score to each of the one or more organizational tables respective to at least one object type based on a scoring criterion for that object type. An organizational tables labeler (44) assigns a table type label to each of the one or more organizational tables based on the scores.

Type: Application

Filed: September 7, 2006

Publication date: March 13, 2008

Inventors: Herve Dejean, Jean-Luc Meunier
System and method for structured document authoring

Patent number: 7296223

Abstract: A method for creating a structured document, wherein a structured document comprises a plurality of content elements wrapped in pairs of tags, includes parsing a document of a particular type containing content into a plurality of content elements; and for each content element, suggesting an optimal tag according to a tag suggestion procedure. The tag suggestion procedure includes providing sample data which has been converted into a structured sample document; deriving a set of tags from the structured sample document; evaluating the set of tags according to tag suggestion criteria to determine an optimal tag for the content element. The optimal tag may be a single tag or a pattern of tags which maximizes a similarity function with patterns found in the sample data.

Type: Grant

Filed: June 27, 2003

Date of Patent: November 13, 2007

Assignee: Xerox Corporation

Inventors: Boris Chidlovskii, Hervé Déjean
Table of contents extraction with improved robustness

Publication number: 20070196015

Abstract: In a method for identifying a table of contents in a document (10), text fragments are extracted (12) from the document. There are identified (20, 30, 34, 38): (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion (130). The identified table of contents entries and linked text fragments (110) are validated based on at least one validation criterion (162) related to distribution of the linked text fragments.

Type: Application

Filed: February 23, 2006

Publication date: August 23, 2007

Inventors: Jean-Luc Meunier, Herve Dejean
Personal information retrieval using knowledge bases for optical character recognition correction

Publication number: 20070133874

Abstract: In a system for updating a contacts database (42, 46), a portable imager (12) acquires a digital business card image (10). An image segmenter (16) extracts text image segments from the digital business card image. An optical character recognizer (OCR) (26) generates one or more textual content candidates for each text image segment. A scoring processor (36) scores each textual content candidate based on results of database queries respective to the textual content candidates. A content selector (38) selects a textual content candidate for each text image segment based at least on the assigned scores. An interface (50) is configured to update the contacts list based on the selected textual content candidates.

Type: Application

Filed: December 12, 2005

Publication date: June 14, 2007

Inventors: Marco Bressan, Herve Dejean, Christopher Dance
XML-based architecture for rule induction system

Publication number: 20070094201

Abstract: In a rule induction method, an overbroad candidate rule is selected for categorizing a node to be categorized. The candidate rule is specialized by: (i) adding a rule node corresponding to a node level of structured training examples; (ii) including in a rule node a rule pertaining to an attribute of at least one node of the corresponding node level to produce a specialized candidate rule; and (iii) evaluating the specialized candidate rule respective to the structured training examples.

Type: Application

Filed: September 23, 2005

Publication date: April 26, 2007

Inventor: Herve Dejean
Systems and methods for converting legacy and proprietary documents into extended mark-up language format

Publication number: 20070061713

Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve decomposing the input document, labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.

Type: Application

Filed: November 13, 2006

Publication date: March 15, 2007

Applicant: Xerox Corporation

Inventors: Boris Chidlovskii, Herve Dejean
Text correction for PDF converters

Publication number: 20070055933

Abstract: To correct at least one extraneous or missing space in a document, weights are assigned to tokens contained in a dictionary. Each token is defined by an ordered sequence of non-space symbols. The weights are assigned based on at least one of a token length and frequency of occurrence of the token in the document. Corrected text is generated from text of the document by applying an ordered sequence of symbol-level transformations selected from a group of symbol-level transformations including at least (i) deleting a space, (ii) inserting a space, and (iii) copying a symbol. The ordered sequence of symbol-level transformations is optimized respective to an objective function dependent upon the weights of tokens of the corrected text.

Type: Application

Filed: September 2, 2005

Publication date: March 8, 2007

Inventors: Herve Dejean, Andre Kempe
Systems and methods for converting legacy and proprietary documents into extended mark-up language format

Patent number: 7165216

Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.

Type: Grant

Filed: January 14, 2004

Date of Patent: January 16, 2007

Assignee: Xerox Corporation

Inventors: Boris Chidlovskii, Herve Dejean
Structuring document based on table of contents

Publication number: 20060248070

Abstract: A document is organized as a plurality of nodes associated with a table of contents. The nodes are clustered into a plurality of clusters based on a similarity criterion. One of the clusters is identified as corresponding to a highest or lowest level of the table of contents based on a selection criterion. The highest or lowest level is assigned to the nodes belonging to the identified cluster. The identifying and assigning are repeated to assign levels to the nodes belonging to each next highest or lowest level of the table of contents. The repeated identifying is based on the selection criteria applied disregarding nodes that have already been assigned a level. The document is structured based at least in part on the levels assigned to the table of contents nodes.

Type: Application

Filed: April 27, 2005

Publication date: November 2, 2006

Inventors: Herve Dejean, Jean-Luc Meunier
Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents

Publication number: 20060156226

Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.

Type: Application

Filed: January 10, 2005

Publication date: July 13, 2006

Inventors: Herve Dejean, Jean-Luc Meunier
Method and apparatus for detecting a table of contents and reference determination

Publication number: 20060155703

Abstract: In a method for identifying a table of contents in a document, an ordered sequence of text fragments is derived from the document. A table of contents is selected as a contiguous sub-sequence of the ordered sequence of text fragments satisfying the criteria: (i) entries defined by text fragments of the table of contents each have a link to a target text fragment having textual similarity with the entry; (ii) no target text fragment lies within the table of contents; and (iii) the target text fragments have an ascending ordering corresponding to an ascending ordering of the entries defining the target text fragments.

Type: Application

Filed: January 10, 2005

Publication date: July 13, 2006

Inventors: Herve Dejean, Jean-Luc Meunier, Olivier Fambon
Method and apparatus for structuring documents based on layout, content and collection

Publication number: 20060155700

Abstract: A method and apparatus is provided for converting a document in a first format essentially comprising a flat layout structure into a structured document in a hierarchical form in accordance with predetermined attributes identified from the input format. The process comprises fragmenting the input document into a plurality of document content elements in accordance with a predetermined set of document attributes identifiable from the input document format. The content elements are clustered into selective sets having similar document attributes. The clustered sets are validated with reference to common textual properties organizational content common in documents in the collection. The clustered sets are then categorized into predetermined categories comprising structured elements of the structured document format and the document content elements are organized by hierarchical dependency from the predetermined categories wherein the organized document elements comprise the desired structured document format.

Type: Application

Filed: January 10, 2005

Publication date: July 13, 2006

Inventors: Herve Dejean, Veronika Lux, Sandrine Ribeau
Smart string replacement

Publication number: 20060136352

Abstract: String replacement is performed in text using linguistic processing. The linguistic processing identifies the existence of direct or indirect links between the string to be replaced and other strings in the text. Morphological, syntactic, anaphoric, or semantic inconsistencies, which are introduced in strings with the identified direct or indirect links to the string that is to be replaced are detected and corrected.

Type: Application

Filed: December 17, 2004

Publication date: June 22, 2006

Inventors: Caroline Brun, Herve Dejean, Caroline Hagega
Method and apparatus for identifying bilingual lexicons in comparable corpora

Publication number: 20060009963

Abstract: Various methods formulated using a geometric interpretation for identifying bilingual pairs in comparable corpora using a bilingual dictionary are disclosed. The methods may be used separately or in combination to compute the similarity between bilingual pairs.

Type: Application

Filed: November 1, 2004

Publication date: January 12, 2006

Inventors: Eric Gaussier, Jean-Michel Renders, Herve Dejean, Cyril Goutte, Irina Matveeva
Systems and methods for converting legacy and proprietary documents into extended mark-up language format

Publication number: 20050154979

Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.

Type: Application

Filed: January 14, 2004

Publication date: July 14, 2005

Applicant: XEROX CORPORATION

Inventors: Boris Chidlovskii, Herve Dejean
System and method for structured document authoring

Publication number: 20040268236

Abstract: A method for creating a structured document, wherein a structured document comprises a plurality of content elements wrapped in pairs of tags, includes parsing a document of a particular type containing content into a plurality of content elements; and for each content element, suggesting an optimal tag according to a tag suggestion procedure. The tag suggestion procedure includes providing sample data which has been converted into a structured sample document; deriving a set of tags from the structured sample document; evaluating the set of tags according to tag suggestion criteria to determine an optimal tag for the content element. The optimal tag may be a single tag or a pattern of tags which maximizes a similarity function with patterns found in the sample data.

Type: Application

Filed: June 27, 2003

Publication date: December 30, 2004

Applicant: Xerox Corporation

Inventors: Boris Chidlovskii, Herve Dejean

prev 1 2 3 4