Patents by Inventor Boris Chidlovskii

Boris Chidlovskii has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

METHOD FOR CATEGORIZING LINKED DOCUMENTS BY CO-TRAINED LABEL EXPANSION

Publication number: 20110106732

Abstract: Systems and methods are described that facilitate categorizing a group of linked web pages. A plurality of web pages each contains at least one link to another page within the group. A feature analyzer evaluates features associated with the one or more web pages to identify content, layout, links and/or metadata associated with the one or more web pages and identifies features that are labeled and features that are unlabeled. A graphing component creates a vector associated with each web page feature wherein vectors for unlabeled features are determined by their graphical proximity to features that are labeled. A co-training component receives the graph of vectors from the graphing component and leverages the disparate web page features to categorize each aspect of each feature of the page. A page categorizer receives aspect categorization information from the co-training component and categorizes the web page based at least upon this information.

Type: Application

Filed: October 29, 2009

Publication date: May 5, 2011

Applicant: Xerox Corporation

Inventor: Boris Chidlovskii
MULTI-MODALITY CLASSIFICATION FOR ONE-CLASS CLASSIFICATION IN SOCIAL NETWORKS

Publication number: 20110103682

Abstract: A classification apparatus, method, and computer program product for multi-modality classification are disclosed. For each of a plurality of modalities, the method includes extracting features from objects in a set of objects. The objects include electronic mail messages. A representation of each object for that modality is generated, based on its extracted features. At least one of the plurality of modalities is a social network modality in which social network features are extracted from a social network implicit in the set of electronic mail messages. A classifier system is trained based on class labels of a subset of the set of objects and on the representations generated for each of the modalities. With the trained classifier system, labels are predicted for unlabeled objects in the set of objects.

Type: Application

Filed: October 29, 2009

Publication date: May 5, 2011

Applicant: Xerox Corporation

Inventors: Boris CHIDLOVSKII, Matthijs HOVELYNCK
Stacked generalization learning for document annotation

Patent number: 7890438

Abstract: A document annotation method includes modeling data elements of an input document and dependencies between the data elements as a dependency network. Static features of at least some of the data elements are defined, each expressing a relationship between a characteristic of the data element and its label. Dynamic features are defined which define links between an element and labels of the element and of a second element. Parameters of a collective probabilistic model for the document are learned, each expressing a conditional probability that a first data element should be labeled with information derived from a label of a neighbor data element linked to the first data element by a dynamic feature. The learning includes decomposing a globally trained model into a set of local learning models. The local learning models each employ static features to generate estimations of the neighbor element labels for at least one of the data elements.

Type: Grant

Filed: December 12, 2007

Date of Patent: February 15, 2011

Assignee: Xerox Corporation

Inventor: Boris Chidlovskii
Document alignment systems for legacy document conversions

Patent number: 7882119

Abstract: A method for aligning documents which may be in different XML formats includes inputting source and target leaves of a source and documents in first and second tree structured formats and assigning a cost to each of a plurality of matches. Each match may include a source leaf and a target leaf or be an unmatched source or target leaf. Matches are identified for which a total cost is minimal, wherein each of the leaves is in at least one of the identified matches. From the identified matches, groups of two or more matches are identified which have a leaf in common. From the groups, probable matches are identified in which more that one target leaf is matched with at least one source leaf or more than one source leaf is matched with a target leaf. An alignment between leaves of the target document and leaves of the source document is output which includes the probable matches.

Type: Grant

Filed: December 22, 2005

Date of Patent: February 1, 2011

Assignee: Xerox Corporation

Inventors: Andre Bergholz, Boris Chidlovskii
SCALABLE INDEXING FOR LAYOUT BASED DOCUMENT RETRIEVAL AND RANKING

Publication number: 20110022599

Abstract: A computer-based method and a system for indexing, querying, and ranking documents based on layout are provided. The method includes providing a plurality of documents to computer memory, extracting layout blocks from the provided documents, clustering the layout blocks into a plurality of layout block clusters, computing a representative block for each of the layout block clusters, generating a document index for each provided document based on the layout blocks of the document and the computed representatives blocks, clustering the created document indexes into a plurality of document index clusters, and generating a representative cluster index for each of the document index clusters. The indexes generated, together with the representative blocks and document index clusters, can be stored and used for retrieval of documents responsive to a layout query.

Type: Application

Filed: September 9, 2009

Publication date: January 27, 2011

Applicant: Xerox Corporation

Inventors: Boris Chidlovskii, Loïc M. Lecerf
Method for transforming data elements within a classification system based in part on input from a human annotator/expert

Publication number: 20100306141

Abstract: A method and system are provided for classifying data items such as a document based upon identification of element instances within the data item. A training set of classes is provided where each class is associated with one or more features indicative of accurate identification of an element instance within the data item. Upon the identification of the data item with the training set, a confidence factor is computed that the selected element instance is accurately identified. When a selected element instance has a low confidence factor, the associated features for the predicted class are changed by an annotator/expert so that the changed class definition of the new associated feature provides a higher confidence factor of accurate identification of element instances within the data item.

Type: Application

Filed: June 3, 2010

Publication date: December 2, 2010

Applicant: Xerox Corporation

Inventor: Boris Chidlovskii
AVERAGE CASE ANALYSIS FOR EFFICIENT SPATIAL DATA STRUCTURES

Publication number: 20100205181

Abstract: A computer performed method models a spatial index having n spatial regions defined in a multidimensional space using a tree-based model representing an infinite number of arrangements of n spatial regions in the multidimensional space allowable by the spatial index using a finite number of tree representations, computes an average retrieval complexity measure for content retrieval using the spatial index based on the tree based model, and provides a spatial index recommendation based on the average retrieval complexity measure. In some embodiments a spatial index selection module selects the spatial index based on average retrieval complexity measures for candidate spatial indices that are functionally dependent upon a number of spatial regions to be defined by the spatial index.

Type: Application

Filed: February 9, 2009

Publication date: August 12, 2010

Applicant: Xerox Corporation

Inventor: Boris Chidlovskii
Method for transforming data elements within a classification system based in part on input from a human annotator/expert

Patent number: 7756800

Abstract: A method and system are provided for classifying data items such as a document based upon identification of element instances within the data item. A training set of classes is provided where each class is associated with one or more features indicative of accurate identification of an element instance within the data item. Upon the identification of the data item with the training set, a confidence factor is computed that the selected element instance is accurately identified. When a selected element instance has a low confidence factor, the associated features for the predicted class are changed by an annotator/expert so that the changed class definition of the new associated feature provides a higher confidence factor of accurate identification of element instances within the data item.

Type: Grant

Filed: December 14, 2006

Date of Patent: July 13, 2010

Assignee: Xerox Corporation

Inventor: Boris Chidlovskii
METHOD OF FEATURE EXTRACTION FROM NOISY DOCUMENTS

Publication number: 20100150448

Abstract: Aspect of the exemplary embodiment relate to a method and apparatus for automatically identifying features that are suitable for use by a classifier in assigning class labels to text sequences extracted from noisy documents. The exemplary method includes receiving a dataset of text sequences, automatically identifying a set of patterns in the text sequences, and filtering the patterns to generate a set of features. The filtering includes at least one of filtering out redundant patterns and filtering out irrelevant patterns. The method further includes outputting at least some of the features in the set of features, optionally after fusing features which are determined not to affect the classifiers accuracy if they are merged.

Type: Application

Filed: December 17, 2008

Publication date: June 17, 2010

Applicant: Xerox Corporation

Inventors: Loic Lecerf, Boris Chidlovskii
Systems and methods for converting legacy and proprietary documents into extended mark-up language format

Patent number: 7730396

Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve decomposing the input document, labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.

Type: Grant

Filed: November 13, 2006

Date of Patent: June 1, 2010

Assignee: Xerox Corporation

Inventors: Boris Chidlovskii, Hervé Dejean
SCALABLE FEATURE SELECTION FOR MULTI-CLASS PROBLEMS

Publication number: 20090271338

Abstract: In a feature filtering approach, a set of relevant features and a set of training objects classified respective to a set of classes are provided. A candidate feature and a second feature are selected from the set of relevant features. An approximate Markov blanket criterion is computed that is indicative of whether the candidate feature is redundant in view of the second feature. The approximate Markov blanket criterion includes at least one dependency on less than the entire set of classes. An optimized set of relevant features is defined, consisting of a sub-set of the set of relevant features from which features indicated as redundant by the selecting and computing are removed.

Type: Application

Filed: April 23, 2008

Publication date: October 29, 2009

Applicant: XEROX CORPORATION

Inventors: Boris Chidlovskii, Loic Lecerf
STACKED GENERALIZATION LEARNING FOR DOCUMENT ANNOTATION

Publication number: 20090157572

Abstract: A document annotation method includes modeling data elements of an input document and dependencies between the data elements as a dependency network. Static features of at least some of the data elements are defined, each expressing a relationship between a characteristic of the data element and its label. Dynamic features are defined which define links between an element and labels of the element and of a second element. Parameters of a collective probabilistic model for the document are learned, each expressing a conditional probability that a first data element should be labeled with information derived from a label of a neighbor data element linked to the first data element by a dynamic feature. The learning includes decomposing a globally trained model into a set of local learning models. The local learning models each employ static features to generate estimations of the neighbor element labels for at least one of the data elements.

Type: Application

Filed: December 12, 2007

Publication date: June 18, 2009

Inventor: Boris Chidlovskii
Semi-supervised visual clustering

Publication number: 20090018995

Abstract: A clustering system includes a visual mapping sub-system configured to display an N-dimensional to two- or three-dimensional mapping of items to be clustered, where N is greater than three, the mapping having mapping parameters for the N-dimensions. A user interface sub-system is configured to receive user inputted values for the mapping parameters, user inputted values selecting whether selected mapping parameters are fixed or adjustable, and user inputted values associating selected items with selected groups. An adjustment sub-system is configured to adjust the adjustable mapping parameters, without adjusting any fixed mapping parameters, to improve a measure of distinctness of one or more groups of items in the two- or three-dimensional mapping.

Type: Application

Filed: July 13, 2007

Publication date: January 15, 2009

Inventors: Boris Chidlovskii, Loic Lecerf
Method for automatic wrapper repair

Patent number: 7440951

Abstract: A method of information extraction from a Web page using a broken wrapper, includes using the wrapper to extract strings from the Web page parsed in forward direction; analyzing the extracted strings according to a set of rules for assigning labels associated with the wrapper; assigning labels to those strings which satisfy the label rules; classifying the extracted strings based on content features of the labeled extracted strings; validating those labeled extracted strings which satisfy the label rules within some threshold value.

Type: Grant

Filed: December 5, 2005

Date of Patent: October 21, 2008

Assignee: Xerox Corporation

Inventor: Boris Chidlovskii
Method for automatic wrapper repair

Patent number: 7440974

Abstract: A method of information extraction from a Web page using an initial wrapper which has become partially inoperative, wherein the initial wrapper comprises an initial set of rules for extracting information and for assigning labels from a wrapper set of labels to the extracted information, includes using the initial set of rules to extract strings from the Web page parsed in forward direction; analyzing the extracted strings according to the initial set of rules for assigning labels associated with the wrapper; assigning labels to those strings which satisfy the label rules; using the initial set of rules to extract strings from the Web page in backward/(opposite) direction; analyzing the extracted strings according to the set of rules for assigning labels associated with the wrappers; and assigning labels to those unlabeled strings from which satisfy the label rules.

Type: Grant

Filed: December 5, 2005

Date of Patent: October 21, 2008

Assignee: Xerox Corporation

Inventor: Boris Chidlovskii
System and method for transforming legacy documents into XML documents

Patent number: 7440967

Abstract: A method for converting a legacy document into an XML document, includes decomposing the conversion process into a plurality of individual conversion tasks. A legacy document is decomposed into a plurality of document portions. A target XML schema including a plurality of schema components is provided. Local schema are generated from the target XML schema, wherein each local schema includes at least one of the schema components in the target XML schema. A plurality of conversion tasks is generated by associating a local schema and an applicable document portion, wherein each conversion task associates data from the applicable document portion with the applicable schema component in the local schema. For each conversion task, a conversion method is selected and the conversion method is performed on the applicable document portion and local schema. Finally, the results of all the individual conversion tasks are assembled into a target XML document.

Type: Grant

Filed: November 10, 2004

Date of Patent: October 21, 2008

Assignee: Xerox Corporation

Inventor: Boris Chidlovskii
Active learning methods for evolving a classifier

Publication number: 20080147574

Abstract: A method and system are provided for classifying data items such as a document based upon identification of element instances within the data item. A training set of classes is provided where each class is associated with one or more features indicative of accurate identification of an element instance within the data item. Upon the identification of the data item with the training set, a confidence factor is computed that the selected element instance is accurately identified. When a selected element instance has a low confidence factor, the associated features for the predicted class are changed by an annotator/expert so that the changed class definition of the new associated feature provides a higher confidence factor of accurate identification of element instances within the data item.

Type: Application

Filed: December 14, 2006

Publication date: June 19, 2008

Inventor: Boris Chidlovskii
System and method for structured document authoring

Patent number: 7296223

Abstract: A method for creating a structured document, wherein a structured document comprises a plurality of content elements wrapped in pairs of tags, includes parsing a document of a particular type containing content into a plurality of content elements; and for each content element, suggesting an optimal tag according to a tag suggestion procedure. The tag suggestion procedure includes providing sample data which has been converted into a structured sample document; deriving a set of tags from the structured sample document; evaluating the set of tags according to tag suggestion criteria to determine an optimal tag for the content element. The optimal tag may be a single tag or a pattern of tags which maximizes a similarity function with patterns found in the sample data.

Type: Grant

Filed: June 27, 2003

Date of Patent: November 13, 2007

Assignee: Xerox Corporation

Inventors: Boris Chidlovskii, Hervé Déjean
Interactive learning-based document annotation

Publication number: 20070150801

Abstract: A document annotation system 10 includes a graphical user interface 22 used by an annotator 30 to annotate documents. An active learning component 24 trains an annotation model and proposes annotations to documents based on the annotation model. A request handler 26, 32, 34, 42 conveys annotation requests from the graphical user interface 22 to the active learning component 24, conveys proposed annotations from the active learning component 24 to the graphical user interface 22, and selectably conveys evaluation requests from the graphical user interface 22 to a domain expert 40. During annotation, at least some low probability proposed annotations are presented to the annotator 30 by the graphical user interface 22. The presented low probability proposed annotations enhance training of the annotation model by the active learning component 24.

Type: Application

Filed: December 23, 2005

Publication date: June 28, 2007

Inventors: Boris Chidlovskii, Thierry Jacquin
Document alignment systems for legacy document conversions

Publication number: 20070150443

Abstract: A method for aligning documents which may be in different XML formats includes inputting source and target leaves of a source and documents in first and second tree structured formats and assigning a cost to each of a plurality of matches. Each match may include a source leaf and a target leaf or be an unmatched source or target leaf. Matches are identified for which a total cost is minimal, wherein each of the leaves is in at least one of the identified matches. From the identified matches, groups of two or more matches are identified which have a leaf in common. From the groups, probable matches are identified in which more that one target leaf is matched with at least one source leaf or more than one source leaf is matched with a target leaf. An alignment between leaves of the target document and leaves of the source document is output which includes the probable matches.

Type: Application

Filed: December 22, 2005

Publication date: June 28, 2007

Inventors: Andre Bergholz, Boris Chidlovskii

prev 1 2 3 4 5 6 7 next