Patents by Inventor Hervé Déjéan
Hervé Déjéan has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20110276874Abstract: A computer-implemented method and system for generation of page templates are provided. The method includes providing a document in computer memory. Using a computer processor, page elements within the document are identified and labeled. For each page of the document, a set of geometric relations between pairs of page elements co-occurring on the page is computed, and the set of geometric relations is associated with the page. The method also includes generating a set of page template candidates based at least in part on the computed geometric relations, selecting page templates from the set of page template candidates, and outputting the selected page templates.Type: ApplicationFiled: May 4, 2010Publication date: November 10, 2011Applicant: Xerox CorporationInventor: Hervé Déjean
-
Patent number: 8023740Abstract: To perform notes detection, candidate reference marks are identified in a document. A starting note zone is identified in the document. A pair of similar reference marks is identified from the candidate reference marks including a first reference mark in the note zone and a second reference mark outside the note zone. The document is marked up to indicate a note associated with the first and second reference marks.Type: GrantFiled: August 13, 2007Date of Patent: September 20, 2011Assignee: Xerox CorporationInventor: Hervé Déjean
-
Patent number: 7991709Abstract: A method is provided for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.Type: GrantFiled: January 28, 2008Date of Patent: August 2, 2011Assignee: Xerox CorporationInventors: Herve Dejean, Jean-Luc Meunier
-
Publication number: 20110145701Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.Type: ApplicationFiled: February 23, 2011Publication date: June 16, 2011Applicant: XEROX CORPORATIONInventors: Hervé Déjean, Jean-Luc Meunier
-
Patent number: 7937653Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.Type: GrantFiled: January 10, 2005Date of Patent: May 3, 2011Assignee: Xerox CorporationInventors: Hervé Déjean, Jean-Luc Meunier
-
Patent number: 7852499Abstract: To detect captions in a document that includes text fragments and objects of interest, a signature is assigned to each text fragment. The signature is the value for that text fragment of a text fragment representation comprising at least one text fragment attribute. A caption signature is identified as a signature assigned to a substantial number of text fragments that are near at least one object of interest in the document. One or more captions are detected as one or more text fragments each assigned a caption signature.Type: GrantFiled: September 27, 2006Date of Patent: December 14, 2010Assignee: Xerox CorporationInventor: Hervé Déjean
-
Publication number: 20100306260Abstract: Numbered sequences detection includes (i) extracting one or more numbered item token patterns from a document comprising an ordered sequence of text units, each numbered item token pattern including an incremental portion and a fixed portion that matches at least one text unit of the document and (ii) identifying at least one numbered sequence in the document conforming with a matching numbered item token pattern of the extracted one or more numbered item token patterns. The identified at least one numbered sequence comprises an ordered sub-sequence of text units of the document that match the matching numbered item token pattern. The detection may further comprise determining that a second type of numbered sequence nests in the document between consecutive text units belonging to a numbered sequence of a first type, and optimizing one or more numbered sequences of the second type based on information provided by the determining.Type: ApplicationFiled: May 29, 2009Publication date: December 2, 2010Applicant: Xerox CorporationInventor: Herve Dejean
-
Patent number: 7827484Abstract: To correct at least one extraneous or missing space in a document, weights are assigned to tokens contained in a dictionary. Each token is defined by an ordered sequence of non-space symbols. The weights are assigned based on at least one of a token length and frequency of occurrence of the token in the document. Corrected text is generated from text of the document by applying an ordered sequence of symbol-level transformations selected from a group of symbol-level transformations including at least (i) deleting a space, (ii) inserting a space, and (iii) copying a symbol. The ordered sequence of symbol-level transformations is optimized respective to an objective function dependent upon the weights of tokens of the corrected text.Type: GrantFiled: September 2, 2005Date of Patent: November 2, 2010Assignee: Xerox CorporationInventors: Hervé Déjean, André Kempe
-
Patent number: 7826665Abstract: In a system for updating a contacts database (42, 46), a portable imager (12) acquires a digital business card image (10). An image segmenter (16) extracts text image segments from the digital business card image. An optical character recognizer (OCR) (26) generates one or more textual content candidates for each text image segment. A scoring processor (36) scores each textual content candidate based on results of database queries respective to the textual content candidates. A content selector (38) selects a textual content candidate for each text image segment based at least on the assigned scores. An interface (50) is configured to update the contacts list based on the selected textual content candidates.Type: GrantFiled: December 12, 2005Date of Patent: November 2, 2010Assignee: Xerox CorporationInventors: Marco Bressan, Hervé Dejean, Christopher R. Dance
-
Patent number: 7797622Abstract: A method for detection of page numbers in a document includes identifying a plurality of text fragments associated with a plurality of pages of a document. From the identified text fragments, at least one sequence is identified. Each identified sequence includes a plurality of terms. Each term of the sequence is derived from a text fragment selected from the plurality text fragments. The terms of an identified sequence comply with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. A subset of the identified sequences which cover at least some of the pages of the document is computed. Terms of at least some of the subset of the identified sequences are construed as page numbers of pages of the document. Additional page numbers may be identified by considering one or more features of the terms in the subset of identified sequences.Type: GrantFiled: November 15, 2006Date of Patent: September 14, 2010Assignee: Xerox CorporationInventors: Hervé Déjean, Jean-Luc Meunier
-
Patent number: 7788085Abstract: String replacement is performed in text using linguistic processing. The linguistic processing identifies the existence of direct or indirect links between the string to be replaced and other strings in the text. Morphological, syntactic, anaphoric, or semantic inconsistencies, which are introduced in strings with the identified direct or indirect links to the string that is to be replaced are detected and corrected.Type: GrantFiled: December 17, 2004Date of Patent: August 31, 2010Assignee: Xerox CorporationInventors: Caroline Brun, Herve Dejean, Caroline Hagege
-
Patent number: 7743327Abstract: In a method for identifying a table of contents in a document (10), text fragments are extracted (12) from the document. There are identified (20, 30, 34, 38): (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion (130). The identified table of contents entries and linked text fragments (110) are validated based on at least one validation criterion (162) related to distribution of the linked text fragments.Type: GrantFiled: February 23, 2006Date of Patent: June 22, 2010Assignee: Xerox CorporationInventors: Jean-Luc Meunier, Hervé Déjean
-
Patent number: 7730396Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve decomposing the input document, labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.Type: GrantFiled: November 13, 2006Date of Patent: June 1, 2010Assignee: Xerox CorporationInventors: Boris Chidlovskii, Hervé Dejean
-
Publication number: 20100107045Abstract: Reference identification and resolution identifies reference text fragments in a document and associates referenced object text fragments in the document with the identified reference text fragments. Reference profiles are abstracted from the document. Each reference profile specifies at least a reference number and an object type identifier. A reference profile is paired with an object text fragment of the document containing the reference number of the reference profile. The pairing is repeated to associate reference profiles with object text fragments. A reference text fragment of the document satisfying one of the reference profiles is associated with the object text fragment paired with the satisfied reference profile. The associating is repeated to associate reference text fragments of the document with object text fragments.Type: ApplicationFiled: October 27, 2008Publication date: April 29, 2010Applicant: XEROX CORPORATIONInventors: Katja Filippova, Herve Dejean
-
Patent number: 7693848Abstract: A method and apparatus is provided for converting a document in a first format essentially comprising a flat layout structure into a structured document in a hierarchical form in accordance with predetermined attributes identified from the input format. The process comprises fragmenting the input document into a plurality of document content elements in accordance with a predetermined set of document attributes identifiable from the input document format. The content elements are clustered into selective sets having similar document attributes. The clustered sets are validated with reference to common textual properties organizational content common in documents in the collection. The clustered sets are then categorized into predetermined categories comprising structured elements of the structured document format and the document content elements are organized by hierarchical dependency from the predetermined categories wherein the organized document elements comprise the desired structured document format.Type: GrantFiled: January 10, 2005Date of Patent: April 6, 2010Assignee: Xerox CorporationInventors: Hervé Déjean, Veronika Lux, Sandrine Ribeau
-
Patent number: 7620539Abstract: Various methods formulated using a geometric interpretation for identifying bilingual pairs in comparable corpora using a bilingual dictionary are disclosed. The methods may be used separately or in combination to compute the similarity between bilingual pairs.Type: GrantFiled: November 1, 2004Date of Patent: November 17, 2009Assignee: Xerox CorporationInventors: Eric Gaussier, Jean-Michel Renders, Herve Dejean, Cyril Goutte, Irina Matveeva
-
Publication number: 20090192956Abstract: A method is provided for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.Type: ApplicationFiled: January 28, 2008Publication date: July 30, 2009Applicant: XEROX CORPORATIONInventors: Herve Dejean, Jean-Luc Meunier
-
Publication number: 20090110268Abstract: An initial organizational table for a document is determined based on textual similarity between entries of the organizational table and target text fragments and not taking into account text formatting. A classifier is trained to identify text fragment pairs consisting of entries of the organizational table and corresponding target text fragments based at least in part on text formatting features. The training employs a training set of examples annotated based on the initial organizational table. The initial organizational table is updated using the trained classifier.Type: ApplicationFiled: October 25, 2007Publication date: April 30, 2009Applicant: XEROX CORPORATIONInventors: Herve Dejean, Jean-Luc Meunier
-
Publication number: 20090046918Abstract: To perform notes detection, candidate reference marks are identified in a document. A starting note zone is identified in the document. A pair of similar reference marks is identified from the candidate reference marks including a first reference mark in the note zone and a second reference mark outside the note zone. The document is marked up to indicate a note associated with the first and second reference marks.Type: ApplicationFiled: August 13, 2007Publication date: February 19, 2009Inventor: Herve Dejean
-
Publication number: 20080114757Abstract: A method for detection of page numbers in a document includes identifying a plurality of text fragments associated with a plurality of pages of a document. From the identified text fragments, at least one sequence is identified. Each identified sequence includes a plurality of terms. Each term of the sequence is derived from a text fragment selected from the plurality text fragments. The terms of an identified sequence comply with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. A subset of the identified sequences which cover at least some of the pages of the document is computed. Terms of at least some of the subset of the identified sequences are construed as page numbers of pages of the document. Additional page numbers may be identified by considering one or more features of the terms in the subset of identified sequences.Type: ApplicationFiled: November 15, 2006Publication date: May 15, 2008Inventors: Herve Dejean, Jean-Luc Meunier