Patents by Inventor Boris Chidlovskii
Boris Chidlovskii has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20110106732Abstract: Systems and methods are described that facilitate categorizing a group of linked web pages. A plurality of web pages each contains at least one link to another page within the group. A feature analyzer evaluates features associated with the one or more web pages to identify content, layout, links and/or metadata associated with the one or more web pages and identifies features that are labeled and features that are unlabeled. A graphing component creates a vector associated with each web page feature wherein vectors for unlabeled features are determined by their graphical proximity to features that are labeled. A co-training component receives the graph of vectors from the graphing component and leverages the disparate web page features to categorize each aspect of each feature of the page. A page categorizer receives aspect categorization information from the co-training component and categorizes the web page based at least upon this information.Type: ApplicationFiled: October 29, 2009Publication date: May 5, 2011Applicant: Xerox CorporationInventor: Boris Chidlovskii
-
Publication number: 20110103682Abstract: A classification apparatus, method, and computer program product for multi-modality classification are disclosed. For each of a plurality of modalities, the method includes extracting features from objects in a set of objects. The objects include electronic mail messages. A representation of each object for that modality is generated, based on its extracted features. At least one of the plurality of modalities is a social network modality in which social network features are extracted from a social network implicit in the set of electronic mail messages. A classifier system is trained based on class labels of a subset of the set of objects and on the representations generated for each of the modalities. With the trained classifier system, labels are predicted for unlabeled objects in the set of objects.Type: ApplicationFiled: October 29, 2009Publication date: May 5, 2011Applicant: Xerox CorporationInventors: Boris CHIDLOVSKII, Matthijs HOVELYNCK
-
Patent number: 7890438Abstract: A document annotation method includes modeling data elements of an input document and dependencies between the data elements as a dependency network. Static features of at least some of the data elements are defined, each expressing a relationship between a characteristic of the data element and its label. Dynamic features are defined which define links between an element and labels of the element and of a second element. Parameters of a collective probabilistic model for the document are learned, each expressing a conditional probability that a first data element should be labeled with information derived from a label of a neighbor data element linked to the first data element by a dynamic feature. The learning includes decomposing a globally trained model into a set of local learning models. The local learning models each employ static features to generate estimations of the neighbor element labels for at least one of the data elements.Type: GrantFiled: December 12, 2007Date of Patent: February 15, 2011Assignee: Xerox CorporationInventor: Boris Chidlovskii
-
Patent number: 7882119Abstract: A method for aligning documents which may be in different XML formats includes inputting source and target leaves of a source and documents in first and second tree structured formats and assigning a cost to each of a plurality of matches. Each match may include a source leaf and a target leaf or be an unmatched source or target leaf. Matches are identified for which a total cost is minimal, wherein each of the leaves is in at least one of the identified matches. From the identified matches, groups of two or more matches are identified which have a leaf in common. From the groups, probable matches are identified in which more that one target leaf is matched with at least one source leaf or more than one source leaf is matched with a target leaf. An alignment between leaves of the target document and leaves of the source document is output which includes the probable matches.Type: GrantFiled: December 22, 2005Date of Patent: February 1, 2011Assignee: Xerox CorporationInventors: Andre Bergholz, Boris Chidlovskii
-
Publication number: 20110022599Abstract: A computer-based method and a system for indexing, querying, and ranking documents based on layout are provided. The method includes providing a plurality of documents to computer memory, extracting layout blocks from the provided documents, clustering the layout blocks into a plurality of layout block clusters, computing a representative block for each of the layout block clusters, generating a document index for each provided document based on the layout blocks of the document and the computed representatives blocks, clustering the created document indexes into a plurality of document index clusters, and generating a representative cluster index for each of the document index clusters. The indexes generated, together with the representative blocks and document index clusters, can be stored and used for retrieval of documents responsive to a layout query.Type: ApplicationFiled: September 9, 2009Publication date: January 27, 2011Applicant: Xerox CorporationInventors: Boris Chidlovskii, Loïc M. Lecerf
-
Publication number: 20100306141Abstract: A method and system are provided for classifying data items such as a document based upon identification of element instances within the data item. A training set of classes is provided where each class is associated with one or more features indicative of accurate identification of an element instance within the data item. Upon the identification of the data item with the training set, a confidence factor is computed that the selected element instance is accurately identified. When a selected element instance has a low confidence factor, the associated features for the predicted class are changed by an annotator/expert so that the changed class definition of the new associated feature provides a higher confidence factor of accurate identification of element instances within the data item.Type: ApplicationFiled: June 3, 2010Publication date: December 2, 2010Applicant: Xerox CorporationInventor: Boris Chidlovskii
-
Publication number: 20100205181Abstract: A computer performed method models a spatial index having n spatial regions defined in a multidimensional space using a tree-based model representing an infinite number of arrangements of n spatial regions in the multidimensional space allowable by the spatial index using a finite number of tree representations, computes an average retrieval complexity measure for content retrieval using the spatial index based on the tree based model, and provides a spatial index recommendation based on the average retrieval complexity measure. In some embodiments a spatial index selection module selects the spatial index based on average retrieval complexity measures for candidate spatial indices that are functionally dependent upon a number of spatial regions to be defined by the spatial index.Type: ApplicationFiled: February 9, 2009Publication date: August 12, 2010Applicant: Xerox CorporationInventor: Boris Chidlovskii
-
Patent number: 7756800Abstract: A method and system are provided for classifying data items such as a document based upon identification of element instances within the data item. A training set of classes is provided where each class is associated with one or more features indicative of accurate identification of an element instance within the data item. Upon the identification of the data item with the training set, a confidence factor is computed that the selected element instance is accurately identified. When a selected element instance has a low confidence factor, the associated features for the predicted class are changed by an annotator/expert so that the changed class definition of the new associated feature provides a higher confidence factor of accurate identification of element instances within the data item.Type: GrantFiled: December 14, 2006Date of Patent: July 13, 2010Assignee: Xerox CorporationInventor: Boris Chidlovskii
-
Publication number: 20100150448Abstract: Aspect of the exemplary embodiment relate to a method and apparatus for automatically identifying features that are suitable for use by a classifier in assigning class labels to text sequences extracted from noisy documents. The exemplary method includes receiving a dataset of text sequences, automatically identifying a set of patterns in the text sequences, and filtering the patterns to generate a set of features. The filtering includes at least one of filtering out redundant patterns and filtering out irrelevant patterns. The method further includes outputting at least some of the features in the set of features, optionally after fusing features which are determined not to affect the classifiers accuracy if they are merged.Type: ApplicationFiled: December 17, 2008Publication date: June 17, 2010Applicant: Xerox CorporationInventors: Loic Lecerf, Boris Chidlovskii
-
Patent number: 7730396Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve decomposing the input document, labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.Type: GrantFiled: November 13, 2006Date of Patent: June 1, 2010Assignee: Xerox CorporationInventors: Boris Chidlovskii, Hervé Dejean
-
Publication number: 20090271338Abstract: In a feature filtering approach, a set of relevant features and a set of training objects classified respective to a set of classes are provided. A candidate feature and a second feature are selected from the set of relevant features. An approximate Markov blanket criterion is computed that is indicative of whether the candidate feature is redundant in view of the second feature. The approximate Markov blanket criterion includes at least one dependency on less than the entire set of classes. An optimized set of relevant features is defined, consisting of a sub-set of the set of relevant features from which features indicated as redundant by the selecting and computing are removed.Type: ApplicationFiled: April 23, 2008Publication date: October 29, 2009Applicant: XEROX CORPORATIONInventors: Boris Chidlovskii, Loic Lecerf
-
Publication number: 20090157572Abstract: A document annotation method includes modeling data elements of an input document and dependencies between the data elements as a dependency network. Static features of at least some of the data elements are defined, each expressing a relationship between a characteristic of the data element and its label. Dynamic features are defined which define links between an element and labels of the element and of a second element. Parameters of a collective probabilistic model for the document are learned, each expressing a conditional probability that a first data element should be labeled with information derived from a label of a neighbor data element linked to the first data element by a dynamic feature. The learning includes decomposing a globally trained model into a set of local learning models. The local learning models each employ static features to generate estimations of the neighbor element labels for at least one of the data elements.Type: ApplicationFiled: December 12, 2007Publication date: June 18, 2009Inventor: Boris Chidlovskii
-
Publication number: 20090018995Abstract: A clustering system includes a visual mapping sub-system configured to display an N-dimensional to two- or three-dimensional mapping of items to be clustered, where N is greater than three, the mapping having mapping parameters for the N-dimensions. A user interface sub-system is configured to receive user inputted values for the mapping parameters, user inputted values selecting whether selected mapping parameters are fixed or adjustable, and user inputted values associating selected items with selected groups. An adjustment sub-system is configured to adjust the adjustable mapping parameters, without adjusting any fixed mapping parameters, to improve a measure of distinctness of one or more groups of items in the two- or three-dimensional mapping.Type: ApplicationFiled: July 13, 2007Publication date: January 15, 2009Inventors: Boris Chidlovskii, Loic Lecerf
-
Patent number: 7440951Abstract: A method of information extraction from a Web page using a broken wrapper, includes using the wrapper to extract strings from the Web page parsed in forward direction; analyzing the extracted strings according to a set of rules for assigning labels associated with the wrapper; assigning labels to those strings which satisfy the label rules; classifying the extracted strings based on content features of the labeled extracted strings; validating those labeled extracted strings which satisfy the label rules within some threshold value.Type: GrantFiled: December 5, 2005Date of Patent: October 21, 2008Assignee: Xerox CorporationInventor: Boris Chidlovskii
-
Patent number: 7440974Abstract: A method of information extraction from a Web page using an initial wrapper which has become partially inoperative, wherein the initial wrapper comprises an initial set of rules for extracting information and for assigning labels from a wrapper set of labels to the extracted information, includes using the initial set of rules to extract strings from the Web page parsed in forward direction; analyzing the extracted strings according to the initial set of rules for assigning labels associated with the wrapper; assigning labels to those strings which satisfy the label rules; using the initial set of rules to extract strings from the Web page in backward/(opposite) direction; analyzing the extracted strings according to the set of rules for assigning labels associated with the wrappers; and assigning labels to those unlabeled strings from which satisfy the label rules.Type: GrantFiled: December 5, 2005Date of Patent: October 21, 2008Assignee: Xerox CorporationInventor: Boris Chidlovskii
-
Patent number: 7440967Abstract: A method for converting a legacy document into an XML document, includes decomposing the conversion process into a plurality of individual conversion tasks. A legacy document is decomposed into a plurality of document portions. A target XML schema including a plurality of schema components is provided. Local schema are generated from the target XML schema, wherein each local schema includes at least one of the schema components in the target XML schema. A plurality of conversion tasks is generated by associating a local schema and an applicable document portion, wherein each conversion task associates data from the applicable document portion with the applicable schema component in the local schema. For each conversion task, a conversion method is selected and the conversion method is performed on the applicable document portion and local schema. Finally, the results of all the individual conversion tasks are assembled into a target XML document.Type: GrantFiled: November 10, 2004Date of Patent: October 21, 2008Assignee: Xerox CorporationInventor: Boris Chidlovskii
-
Publication number: 20080147574Abstract: A method and system are provided for classifying data items such as a document based upon identification of element instances within the data item. A training set of classes is provided where each class is associated with one or more features indicative of accurate identification of an element instance within the data item. Upon the identification of the data item with the training set, a confidence factor is computed that the selected element instance is accurately identified. When a selected element instance has a low confidence factor, the associated features for the predicted class are changed by an annotator/expert so that the changed class definition of the new associated feature provides a higher confidence factor of accurate identification of element instances within the data item.Type: ApplicationFiled: December 14, 2006Publication date: June 19, 2008Inventor: Boris Chidlovskii
-
Patent number: 7296223Abstract: A method for creating a structured document, wherein a structured document comprises a plurality of content elements wrapped in pairs of tags, includes parsing a document of a particular type containing content into a plurality of content elements; and for each content element, suggesting an optimal tag according to a tag suggestion procedure. The tag suggestion procedure includes providing sample data which has been converted into a structured sample document; deriving a set of tags from the structured sample document; evaluating the set of tags according to tag suggestion criteria to determine an optimal tag for the content element. The optimal tag may be a single tag or a pattern of tags which maximizes a similarity function with patterns found in the sample data.Type: GrantFiled: June 27, 2003Date of Patent: November 13, 2007Assignee: Xerox CorporationInventors: Boris Chidlovskii, Hervé Déjean
-
Publication number: 20070150801Abstract: A document annotation system 10 includes a graphical user interface 22 used by an annotator 30 to annotate documents. An active learning component 24 trains an annotation model and proposes annotations to documents based on the annotation model. A request handler 26, 32, 34, 42 conveys annotation requests from the graphical user interface 22 to the active learning component 24, conveys proposed annotations from the active learning component 24 to the graphical user interface 22, and selectably conveys evaluation requests from the graphical user interface 22 to a domain expert 40. During annotation, at least some low probability proposed annotations are presented to the annotator 30 by the graphical user interface 22. The presented low probability proposed annotations enhance training of the annotation model by the active learning component 24.Type: ApplicationFiled: December 23, 2005Publication date: June 28, 2007Inventors: Boris Chidlovskii, Thierry Jacquin
-
Publication number: 20070150443Abstract: A method for aligning documents which may be in different XML formats includes inputting source and target leaves of a source and documents in first and second tree structured formats and assigning a cost to each of a plurality of matches. Each match may include a source leaf and a target leaf or be an unmatched source or target leaf. Matches are identified for which a total cost is minimal, wherein each of the leaves is in at least one of the identified matches. From the identified matches, groups of two or more matches are identified which have a leaf in common. From the groups, probable matches are identified in which more that one target leaf is matched with at least one source leaf or more than one source leaf is matched with a target leaf. An alignment between leaves of the target document and leaves of the source document is output which includes the probable matches.Type: ApplicationFiled: December 22, 2005Publication date: June 28, 2007Inventors: Andre Bergholz, Boris Chidlovskii