Patents by Inventor Haixun Wang

Haixun Wang has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

SYSTEMS AND METHODS FOR SEQUENTIAL MODELING IN LESS THAN ONE SEQUENTIAL SCAN

Publication number: 20080052255

Abstract: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.

Type: Application

Filed: October 31, 2007

Publication date: February 28, 2008

Applicant: International Business Machines Corporation

Inventors: Wei Fan, Haixun Wang, Philip Yu
Systems and methods for sequential modeling in less than one sequential scan

Patent number: 7337161

Abstract: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.

Type: Grant

Filed: July 30, 2004

Date of Patent: February 26, 2008

Assignee: International Business Machines Corporation

Inventors: Wei Fan, Haixun Wang, Philip S. Yu
System and method for scalable processing of multi-way data stream correlations

Publication number: 20070288635

Abstract: A computer implemented method, apparatus, and computer usable program code for processing multi-way stream correlations. Stream data are received for correlation. A task is formed for continuously partitioning a multi-way stream correlation workload into smaller workload pieces. Each of the smaller workload pieces may be processed by a single host. The stream data are sent to different hosts for correlation processing.

Type: Application

Filed: May 4, 2006

Publication date: December 13, 2007

Applicant: International Business Machines Corporation

Inventors: Xiaohui Gu, Haixun Wang, Philip Yu
Index Structure for Supporting Structural XML Queries

Publication number: 20070271243

Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure.

Type: Application

Filed: July 19, 2007

Publication date: November 22, 2007

Inventors: Wei Fan, Haixun Wang, Philip Yu
SYSTEM AND METHOD OF MINING TIME-CHANGING DATA STREAMS USING A DYNAMIC RULE CLASSIFIER HAVING LOW GRANULARITY

Publication number: 20070260568

Abstract: A dynamic rule classifier for mining a data stream includes at least one window for viewing data contained in the data stream and a set of rules for mining the data. Rules are added and the set of rules are updated by algorithms when an drift in a concept within the data occurs, causing unacceptable drops in classification accuracy. The dynamic rule classifier is also implemented as a method and a computer program product.

Type: Application

Filed: April 21, 2006

Publication date: November 8, 2007

Applicant: International Business Machines Corporation

Inventors: Chang-shing Perng, Haixun Wang
Index structure for supporting structural XML queries

Patent number: 7287023

Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure.

Type: Grant

Filed: November 26, 2003

Date of Patent: October 23, 2007

Assignee: International Business Machines Corporation

Inventors: Wei Fan, Haixun Wang, Philip Shi-Lung Yu
Space and time efficient XML graph labeling

Publication number: 20070230488

Abstract: There is provided a method for determining reachability between any two nodes within a graph. The inventive method utilizes a dual-labeling scheme. Initially, a spanning tree is defined for a group of nodes within a graph. Each node in the spanning tree is assigned a unique interval-based label, that describes its dependency from an ancestor node. Non-tree labels are then assigned to each node in the spanning tree that is connected to another node in the spanning tree by a non-tree link. From these labels, reachability of any two nodes in the spanning tree is determined by using only the interval-based labels and the non-tree labels.

Type: Application

Filed: March 31, 2006

Publication date: October 4, 2007

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Philip Yu, Haixun Wang, Hao He
Methods and apparatus for mining attribute associations

Patent number: 7243100

Abstract: Attribute association discovery techniques that support relational-based data mining are disclosed. In one aspect of the invention, a technique for mining attribute associations in a relational data set comprises the following steps/operations. Multiple items are obtained from the relational data set. Then, attribute associations are discovered using: (i) multi-attribute mining templates formed from at least a portion of the multiple items; and (ii) one or more mining preferences specified by a user. The invention provides a novel architecture for the mining search space so as to exploit the inter-relationships among patterns of different templates. The framework is relational-sensitive and supports interactive and online mining.

Type: Grant

Filed: July 30, 2003

Date of Patent: July 10, 2007

Assignee: International Business Machines Corporation

Inventors: Sheng Ma, Chang-shing Perng, Haixun Wang, Philip Shi-Lung Yu
Systems and methods for fast reachability queries in large graphs

Publication number: 20060271304

Abstract: A method which identifies different types of substructures within a graph and encodes them using techniques suitable to the characteristics of each of them. The method is embodied by an efficient two-phase algorithm, where the first phase identifies and encodes strongly connected components as well as tree substructures, and the second phase encodes the remaining reachability relationships by compressing dense rectangular submatrices in the transitive closure matrix.

Type: Application

Filed: May 31, 2005

Publication date: November 30, 2006

Applicant: IBM Corporation

Inventors: Hao He, Haixun Wang, Philip Yu
System and method for load shedding in data mining and knowledge discovery from stream data

Publication number: 20060184527

Abstract: Load shedding schemes for mining data streams. A scoring function is used to rank the importance of stream elements, and those elements with high importance are investigated. In the context of not knowing the exact feature values of a data stream, the use of a Markov model is proposed herein for predicting the feature distribution of a data stream. Based on the predicted feature distribution, one can make classification decisions to maximize the expected benefits. In addition, there is proposed herein the employment of a quality of decision (QoD) metric to measure the level of uncertainty in decisions and to guide load shedding. A load shedding scheme such as presented herein assigns available resources to multiple data streams to maximize the quality of classification decisions. Furthermore, such a load shedding scheme is able to learn and adapt to changing data characteristics in the data streams.

Type: Application

Filed: February 16, 2005

Publication date: August 17, 2006

Applicant: IBM Corporation

Inventors: Yun Chi, Haixun Wang, Philip Yu
Systems and methods for maintaining closed frequent itemsets over a data stream sliding window

Publication number: 20060174024

Abstract: Towards mining closed frequent itemsets over a sliding window using limited memory space, a synopsis data structure to monitor transactions in the sliding window so that one can output the current closed frequent itemsets at any time. Due to time and memory constraints, the synopsis data structure cannot monitor all possible itemsets, but monitoring only frequent itemsets makes it difficult to detect new itemsets when they become frequent. Herein, there is introduced a compact data structure, the closed enumeration tree (CET), to maintain a dynamically selected set of itemsets over a sliding-window. The selected itemsets include a boundary between closed frequent itemsets and the rest of the itemsets Because the boundary is relatively stable, the cost of mining closed frequent itemsets over a sliding window is dramatically reduced to that of mining transactions that can possibly cause boundary movements in the CET.

Type: Application

Filed: January 31, 2005

Publication date: August 3, 2006

Applicant: IBM Corporation

Inventors: Yun Chi, Haixun Wang, Philip Yu
System and method for sequencing XML documents for tree structure indexing

Publication number: 20060161575

Abstract: Sequence-based XML indexing aims at avoiding expensive join operations in query processing. It transforms structured XML data into sequences so that a structured query can be answered holistically through subsequence matching. Herein, there is addresed the problem of query equivalence with respect to this transformation, and thereis introduced a performance-oriented principle for sequencing tree structures. With query equivalence, XML queries can be performed through subsequence matching without join operations, post-processing, or other special handling for problems such as false alarms. There is identified a class of sequencing methods for this purpose, and there is presented a novel subsequence matching algorithm that observe query equivalence. Also introduced is a performance-oriented principle to guide the sequencing of tree structures.

Type: Application

Filed: January 14, 2005

Publication date: July 20, 2006

Applicant: IBM Corporation

Inventors: Wei Fan, Haixun Wang, Philip Yu
Systems and methods for sequential modeling in less than one sequential scan

Publication number: 20060026110

Abstract: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.

Type: Application

Filed: July 30, 2004

Publication date: February 2, 2006

Applicant: IBM Corporation

Inventors: Wei Fan, Haixun Wang, Philip Yu
System and method for continuous diagnosis of data streams

Publication number: 20060010093

Abstract: In connection with the mining of time-evolving data streams, a general framework that mines changes and reconstructs models from a data stream with unlabeled instances or a limited number of labeled instances. In particular, there are defined herein statistical profiling methods that extend a classification tree in order to guess the percentage of drifts in the data stream without any labelled data. Exact error can be estimated by actively sampling a small number of true labels. If the estimated error is significantly higher than empirical expectations, there preferably re-sampled a small number of true labels to reconstruct the decision tree from the leaf node level.

Type: Application

Filed: June 30, 2004

Publication date: January 12, 2006

Applicant: IBM Corporation

Inventors: Wei Fan, Haixun Wang, Philip Yu
System and method for mining time-changing data streams

Publication number: 20050278322

Abstract: A general framework for mining concept-drifting data streams using weighted ensemble classifiers. An ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., is trained from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. An empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

Type: Application

Filed: May 28, 2004

Publication date: December 15, 2005

Applicant: IBM Corporation

Inventors: Wei Fan, Haixun Wang, Philip Yu
Systems and methods for subspace clustering

Publication number: 20050278324

Abstract: Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including e-Commerce target marketing, bioinformatics (large scale scientific data analysis), and automatic computing (web usage analysis), etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle datasets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences.

Type: Application

Filed: May 31, 2004

Publication date: December 15, 2005

Applicant: IBM Corporation

Inventors: Wei Fan, Haixun Wang, Philip Yu
System and method for adaptive pruning

Publication number: 20050131873

Abstract: Disclosed in a method and structure for searching data in databases using an ensemble of models. First the invention performs training. This training orders models within the ensemble in order of prediction accuracy and joins different numbers of models together to form sub-ensembles. The models are joined together in the sub-ensemble in the order of prediction accuracy. Next in the training process, the invention calculates confidence values of each of the sub-ensembles. The confidence is a measure of how closely results form the sub-ensemble will match results from the ensemble. The size of each of the sub-ensembles is variable depending upon the level of confidence, while, to the contrary, the size of the ensemble is fixed. After the training, the invention can make a prediction. First, the invention selects a sub-ensemble that meets a given level of confidence.

Type: Application

Filed: December 16, 2003

Publication date: June 16, 2005

Inventors: Wei Fan, Haixun Wang, Philip Yu
System and method for scalable cost-sensitive learning

Publication number: 20050125434

Abstract: A method (and structure) for processing an inductive learning model for a dataset of examples, includes dividing the dataset into N subsets of data and developing an estimated learning model for the dataset by developing a learning model for a first subset of the N subsets.

Type: Application

Filed: December 3, 2003

Publication date: June 9, 2005

Applicant: International Business Machines Corporation

Inventors: Wei Fan, Haixun Wang, Philip Yu
Near-neighbor search in pattern distance spaces

Publication number: 20050114331

Abstract: Similarity searching techniques are provided. In one aspect, a method for use in finding near-neighbors in a set of objects comprises the following steps. Subspace pattern similarities that the objects in the set exhibit in multi-dimensional spaces are identified. Subspace correlations are defined between two or more of the objects in the set based on the identified subspace pattern similarities for use in identifying near-neighbor objects. A pattern distance index may be created. A method of performing a near-neighbor search of one or more query objects against a set of objects is also provided.

Type: Application

Filed: November 26, 2003

Publication date: May 26, 2005

Applicant: International Business Machines Corporation

Inventors: Haixun Wang, Philip Yu
Index structure for supporting structural XML queries

Publication number: 20050114314

Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure.

Type: Application

Filed: November 26, 2003

Publication date: May 26, 2005

Inventors: Wei Fan, Haixun Wang, Philip Yu

prev 1 2 3 4 5 6 next