Patents by Inventor Boris Chidlovskii

Boris Chidlovskii has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20110106732
    Abstract: Systems and methods are described that facilitate categorizing a group of linked web pages. A plurality of web pages each contains at least one link to another page within the group. A feature analyzer evaluates features associated with the one or more web pages to identify content, layout, links and/or metadata associated with the one or more web pages and identifies features that are labeled and features that are unlabeled. A graphing component creates a vector associated with each web page feature wherein vectors for unlabeled features are determined by their graphical proximity to features that are labeled. A co-training component receives the graph of vectors from the graphing component and leverages the disparate web page features to categorize each aspect of each feature of the page. A page categorizer receives aspect categorization information from the co-training component and categorizes the web page based at least upon this information.
    Type: Application
    Filed: October 29, 2009
    Publication date: May 5, 2011
    Applicant: Xerox Corporation
    Inventor: Boris Chidlovskii
  • Publication number: 20110103682
    Abstract: A classification apparatus, method, and computer program product for multi-modality classification are disclosed. For each of a plurality of modalities, the method includes extracting features from objects in a set of objects. The objects include electronic mail messages. A representation of each object for that modality is generated, based on its extracted features. At least one of the plurality of modalities is a social network modality in which social network features are extracted from a social network implicit in the set of electronic mail messages. A classifier system is trained based on class labels of a subset of the set of objects and on the representations generated for each of the modalities. With the trained classifier system, labels are predicted for unlabeled objects in the set of objects.
    Type: Application
    Filed: October 29, 2009
    Publication date: May 5, 2011
    Applicant: Xerox Corporation
    Inventors: Boris CHIDLOVSKII, Matthijs HOVELYNCK
  • Patent number: 7890438
    Abstract: A document annotation method includes modeling data elements of an input document and dependencies between the data elements as a dependency network. Static features of at least some of the data elements are defined, each expressing a relationship between a characteristic of the data element and its label. Dynamic features are defined which define links between an element and labels of the element and of a second element. Parameters of a collective probabilistic model for the document are learned, each expressing a conditional probability that a first data element should be labeled with information derived from a label of a neighbor data element linked to the first data element by a dynamic feature. The learning includes decomposing a globally trained model into a set of local learning models. The local learning models each employ static features to generate estimations of the neighbor element labels for at least one of the data elements.
    Type: Grant
    Filed: December 12, 2007
    Date of Patent: February 15, 2011
    Assignee: Xerox Corporation
    Inventor: Boris Chidlovskii
  • Patent number: 7882119
    Abstract: A method for aligning documents which may be in different XML formats includes inputting source and target leaves of a source and documents in first and second tree structured formats and assigning a cost to each of a plurality of matches. Each match may include a source leaf and a target leaf or be an unmatched source or target leaf. Matches are identified for which a total cost is minimal, wherein each of the leaves is in at least one of the identified matches. From the identified matches, groups of two or more matches are identified which have a leaf in common. From the groups, probable matches are identified in which more that one target leaf is matched with at least one source leaf or more than one source leaf is matched with a target leaf. An alignment between leaves of the target document and leaves of the source document is output which includes the probable matches.
    Type: Grant
    Filed: December 22, 2005
    Date of Patent: February 1, 2011
    Assignee: Xerox Corporation
    Inventors: Andre Bergholz, Boris Chidlovskii
  • Publication number: 20110022599
    Abstract: A computer-based method and a system for indexing, querying, and ranking documents based on layout are provided. The method includes providing a plurality of documents to computer memory, extracting layout blocks from the provided documents, clustering the layout blocks into a plurality of layout block clusters, computing a representative block for each of the layout block clusters, generating a document index for each provided document based on the layout blocks of the document and the computed representatives blocks, clustering the created document indexes into a plurality of document index clusters, and generating a representative cluster index for each of the document index clusters. The indexes generated, together with the representative blocks and document index clusters, can be stored and used for retrieval of documents responsive to a layout query.
    Type: Application
    Filed: September 9, 2009
    Publication date: January 27, 2011
    Applicant: Xerox Corporation
    Inventors: Boris Chidlovskii, Loïc M. Lecerf
  • Publication number: 20100306141
    Abstract: A method and system are provided for classifying data items such as a document based upon identification of element instances within the data item. A training set of classes is provided where each class is associated with one or more features indicative of accurate identification of an element instance within the data item. Upon the identification of the data item with the training set, a confidence factor is computed that the selected element instance is accurately identified. When a selected element instance has a low confidence factor, the associated features for the predicted class are changed by an annotator/expert so that the changed class definition of the new associated feature provides a higher confidence factor of accurate identification of element instances within the data item.
    Type: Application
    Filed: June 3, 2010
    Publication date: December 2, 2010
    Applicant: Xerox Corporation
    Inventor: Boris Chidlovskii
  • Publication number: 20100205181
    Abstract: A computer performed method models a spatial index having n spatial regions defined in a multidimensional space using a tree-based model representing an infinite number of arrangements of n spatial regions in the multidimensional space allowable by the spatial index using a finite number of tree representations, computes an average retrieval complexity measure for content retrieval using the spatial index based on the tree based model, and provides a spatial index recommendation based on the average retrieval complexity measure. In some embodiments a spatial index selection module selects the spatial index based on average retrieval complexity measures for candidate spatial indices that are functionally dependent upon a number of spatial regions to be defined by the spatial index.
    Type: Application
    Filed: February 9, 2009
    Publication date: August 12, 2010
    Applicant: Xerox Corporation
    Inventor: Boris Chidlovskii
  • Patent number: 7756800
    Abstract: A method and system are provided for classifying data items such as a document based upon identification of element instances within the data item. A training set of classes is provided where each class is associated with one or more features indicative of accurate identification of an element instance within the data item. Upon the identification of the data item with the training set, a confidence factor is computed that the selected element instance is accurately identified. When a selected element instance has a low confidence factor, the associated features for the predicted class are changed by an annotator/expert so that the changed class definition of the new associated feature provides a higher confidence factor of accurate identification of element instances within the data item.
    Type: Grant
    Filed: December 14, 2006
    Date of Patent: July 13, 2010
    Assignee: Xerox Corporation
    Inventor: Boris Chidlovskii
  • Publication number: 20100150448
    Abstract: Aspect of the exemplary embodiment relate to a method and apparatus for automatically identifying features that are suitable for use by a classifier in assigning class labels to text sequences extracted from noisy documents. The exemplary method includes receiving a dataset of text sequences, automatically identifying a set of patterns in the text sequences, and filtering the patterns to generate a set of features. The filtering includes at least one of filtering out redundant patterns and filtering out irrelevant patterns. The method further includes outputting at least some of the features in the set of features, optionally after fusing features which are determined not to affect the classifiers accuracy if they are merged.
    Type: Application
    Filed: December 17, 2008
    Publication date: June 17, 2010
    Applicant: Xerox Corporation
    Inventors: Loic Lecerf, Boris Chidlovskii
  • Patent number: 7730396
    Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve decomposing the input document, labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.
    Type: Grant
    Filed: November 13, 2006
    Date of Patent: June 1, 2010
    Assignee: Xerox Corporation
    Inventors: Boris Chidlovskii, Hervé Dejean
  • Publication number: 20090271338
    Abstract: In a feature filtering approach, a set of relevant features and a set of training objects classified respective to a set of classes are provided. A candidate feature and a second feature are selected from the set of relevant features. An approximate Markov blanket criterion is computed that is indicative of whether the candidate feature is redundant in view of the second feature. The approximate Markov blanket criterion includes at least one dependency on less than the entire set of classes. An optimized set of relevant features is defined, consisting of a sub-set of the set of relevant features from which features indicated as redundant by the selecting and computing are removed.
    Type: Application
    Filed: April 23, 2008
    Publication date: October 29, 2009
    Applicant: XEROX CORPORATION
    Inventors: Boris Chidlovskii, Loic Lecerf
  • Publication number: 20090157572
    Abstract: A document annotation method includes modeling data elements of an input document and dependencies between the data elements as a dependency network. Static features of at least some of the data elements are defined, each expressing a relationship between a characteristic of the data element and its label. Dynamic features are defined which define links between an element and labels of the element and of a second element. Parameters of a collective probabilistic model for the document are learned, each expressing a conditional probability that a first data element should be labeled with information derived from a label of a neighbor data element linked to the first data element by a dynamic feature. The learning includes decomposing a globally trained model into a set of local learning models. The local learning models each employ static features to generate estimations of the neighbor element labels for at least one of the data elements.
    Type: Application
    Filed: December 12, 2007
    Publication date: June 18, 2009
    Inventor: Boris Chidlovskii
  • Publication number: 20090018995
    Abstract: A clustering system includes a visual mapping sub-system configured to display an N-dimensional to two- or three-dimensional mapping of items to be clustered, where N is greater than three, the mapping having mapping parameters for the N-dimensions. A user interface sub-system is configured to receive user inputted values for the mapping parameters, user inputted values selecting whether selected mapping parameters are fixed or adjustable, and user inputted values associating selected items with selected groups. An adjustment sub-system is configured to adjust the adjustable mapping parameters, without adjusting any fixed mapping parameters, to improve a measure of distinctness of one or more groups of items in the two- or three-dimensional mapping.
    Type: Application
    Filed: July 13, 2007
    Publication date: January 15, 2009
    Inventors: Boris Chidlovskii, Loic Lecerf
  • Patent number: 7440951
    Abstract: A method of information extraction from a Web page using a broken wrapper, includes using the wrapper to extract strings from the Web page parsed in forward direction; analyzing the extracted strings according to a set of rules for assigning labels associated with the wrapper; assigning labels to those strings which satisfy the label rules; classifying the extracted strings based on content features of the labeled extracted strings; validating those labeled extracted strings which satisfy the label rules within some threshold value.
    Type: Grant
    Filed: December 5, 2005
    Date of Patent: October 21, 2008
    Assignee: Xerox Corporation
    Inventor: Boris Chidlovskii
  • Patent number: 7440974
    Abstract: A method of information extraction from a Web page using an initial wrapper which has become partially inoperative, wherein the initial wrapper comprises an initial set of rules for extracting information and for assigning labels from a wrapper set of labels to the extracted information, includes using the initial set of rules to extract strings from the Web page parsed in forward direction; analyzing the extracted strings according to the initial set of rules for assigning labels associated with the wrapper; assigning labels to those strings which satisfy the label rules; using the initial set of rules to extract strings from the Web page in backward/(opposite) direction; analyzing the extracted strings according to the set of rules for assigning labels associated with the wrappers; and assigning labels to those unlabeled strings from which satisfy the label rules.
    Type: Grant
    Filed: December 5, 2005
    Date of Patent: October 21, 2008
    Assignee: Xerox Corporation
    Inventor: Boris Chidlovskii
  • Patent number: 7440967
    Abstract: A method for converting a legacy document into an XML document, includes decomposing the conversion process into a plurality of individual conversion tasks. A legacy document is decomposed into a plurality of document portions. A target XML schema including a plurality of schema components is provided. Local schema are generated from the target XML schema, wherein each local schema includes at least one of the schema components in the target XML schema. A plurality of conversion tasks is generated by associating a local schema and an applicable document portion, wherein each conversion task associates data from the applicable document portion with the applicable schema component in the local schema. For each conversion task, a conversion method is selected and the conversion method is performed on the applicable document portion and local schema. Finally, the results of all the individual conversion tasks are assembled into a target XML document.
    Type: Grant
    Filed: November 10, 2004
    Date of Patent: October 21, 2008
    Assignee: Xerox Corporation
    Inventor: Boris Chidlovskii
  • Publication number: 20080147574
    Abstract: A method and system are provided for classifying data items such as a document based upon identification of element instances within the data item. A training set of classes is provided where each class is associated with one or more features indicative of accurate identification of an element instance within the data item. Upon the identification of the data item with the training set, a confidence factor is computed that the selected element instance is accurately identified. When a selected element instance has a low confidence factor, the associated features for the predicted class are changed by an annotator/expert so that the changed class definition of the new associated feature provides a higher confidence factor of accurate identification of element instances within the data item.
    Type: Application
    Filed: December 14, 2006
    Publication date: June 19, 2008
    Inventor: Boris Chidlovskii
  • Patent number: 7296223
    Abstract: A method for creating a structured document, wherein a structured document comprises a plurality of content elements wrapped in pairs of tags, includes parsing a document of a particular type containing content into a plurality of content elements; and for each content element, suggesting an optimal tag according to a tag suggestion procedure. The tag suggestion procedure includes providing sample data which has been converted into a structured sample document; deriving a set of tags from the structured sample document; evaluating the set of tags according to tag suggestion criteria to determine an optimal tag for the content element. The optimal tag may be a single tag or a pattern of tags which maximizes a similarity function with patterns found in the sample data.
    Type: Grant
    Filed: June 27, 2003
    Date of Patent: November 13, 2007
    Assignee: Xerox Corporation
    Inventors: Boris Chidlovskii, Hervé Déjean
  • Publication number: 20070150801
    Abstract: A document annotation system 10 includes a graphical user interface 22 used by an annotator 30 to annotate documents. An active learning component 24 trains an annotation model and proposes annotations to documents based on the annotation model. A request handler 26, 32, 34, 42 conveys annotation requests from the graphical user interface 22 to the active learning component 24, conveys proposed annotations from the active learning component 24 to the graphical user interface 22, and selectably conveys evaluation requests from the graphical user interface 22 to a domain expert 40. During annotation, at least some low probability proposed annotations are presented to the annotator 30 by the graphical user interface 22. The presented low probability proposed annotations enhance training of the annotation model by the active learning component 24.
    Type: Application
    Filed: December 23, 2005
    Publication date: June 28, 2007
    Inventors: Boris Chidlovskii, Thierry Jacquin
  • Publication number: 20070150443
    Abstract: A method for aligning documents which may be in different XML formats includes inputting source and target leaves of a source and documents in first and second tree structured formats and assigning a cost to each of a plurality of matches. Each match may include a source leaf and a target leaf or be an unmatched source or target leaf. Matches are identified for which a total cost is minimal, wherein each of the leaves is in at least one of the identified matches. From the identified matches, groups of two or more matches are identified which have a leaf in common. From the groups, probable matches are identified in which more that one target leaf is matched with at least one source leaf or more than one source leaf is matched with a target leaf. An alignment between leaves of the target document and leaves of the source document is output which includes the probable matches.
    Type: Application
    Filed: December 22, 2005
    Publication date: June 28, 2007
    Inventors: Andre Bergholz, Boris Chidlovskii