Patents by Inventor Haixun Wang
Haixun Wang has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20080052255Abstract: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.Type: ApplicationFiled: October 31, 2007Publication date: February 28, 2008Applicant: International Business Machines CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Patent number: 7337161Abstract: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.Type: GrantFiled: July 30, 2004Date of Patent: February 26, 2008Assignee: International Business Machines CorporationInventors: Wei Fan, Haixun Wang, Philip S. Yu
-
Publication number: 20070288635Abstract: A computer implemented method, apparatus, and computer usable program code for processing multi-way stream correlations. Stream data are received for correlation. A task is formed for continuously partitioning a multi-way stream correlation workload into smaller workload pieces. Each of the smaller workload pieces may be processed by a single host. The stream data are sent to different hosts for correlation processing.Type: ApplicationFiled: May 4, 2006Publication date: December 13, 2007Applicant: International Business Machines CorporationInventors: Xiaohui Gu, Haixun Wang, Philip Yu
-
Publication number: 20070271243Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure.Type: ApplicationFiled: July 19, 2007Publication date: November 22, 2007Inventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20070260568Abstract: A dynamic rule classifier for mining a data stream includes at least one window for viewing data contained in the data stream and a set of rules for mining the data. Rules are added and the set of rules are updated by algorithms when an drift in a concept within the data occurs, causing unacceptable drops in classification accuracy. The dynamic rule classifier is also implemented as a method and a computer program product.Type: ApplicationFiled: April 21, 2006Publication date: November 8, 2007Applicant: International Business Machines CorporationInventors: Chang-shing Perng, Haixun Wang
-
Patent number: 7287023Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure.Type: GrantFiled: November 26, 2003Date of Patent: October 23, 2007Assignee: International Business Machines CorporationInventors: Wei Fan, Haixun Wang, Philip Shi-Lung Yu
-
Publication number: 20070230488Abstract: There is provided a method for determining reachability between any two nodes within a graph. The inventive method utilizes a dual-labeling scheme. Initially, a spanning tree is defined for a group of nodes within a graph. Each node in the spanning tree is assigned a unique interval-based label, that describes its dependency from an ancestor node. Non-tree labels are then assigned to each node in the spanning tree that is connected to another node in the spanning tree by a non-tree link. From these labels, reachability of any two nodes in the spanning tree is determined by using only the interval-based labels and the non-tree labels.Type: ApplicationFiled: March 31, 2006Publication date: October 4, 2007Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Philip Yu, Haixun Wang, Hao He
-
Patent number: 7243100Abstract: Attribute association discovery techniques that support relational-based data mining are disclosed. In one aspect of the invention, a technique for mining attribute associations in a relational data set comprises the following steps/operations. Multiple items are obtained from the relational data set. Then, attribute associations are discovered using: (i) multi-attribute mining templates formed from at least a portion of the multiple items; and (ii) one or more mining preferences specified by a user. The invention provides a novel architecture for the mining search space so as to exploit the inter-relationships among patterns of different templates. The framework is relational-sensitive and supports interactive and online mining.Type: GrantFiled: July 30, 2003Date of Patent: July 10, 2007Assignee: International Business Machines CorporationInventors: Sheng Ma, Chang-shing Perng, Haixun Wang, Philip Shi-Lung Yu
-
Publication number: 20060271304Abstract: A method which identifies different types of substructures within a graph and encodes them using techniques suitable to the characteristics of each of them. The method is embodied by an efficient two-phase algorithm, where the first phase identifies and encodes strongly connected components as well as tree substructures, and the second phase encodes the remaining reachability relationships by compressing dense rectangular submatrices in the transitive closure matrix.Type: ApplicationFiled: May 31, 2005Publication date: November 30, 2006Applicant: IBM CorporationInventors: Hao He, Haixun Wang, Philip Yu
-
Publication number: 20060184527Abstract: Load shedding schemes for mining data streams. A scoring function is used to rank the importance of stream elements, and those elements with high importance are investigated. In the context of not knowing the exact feature values of a data stream, the use of a Markov model is proposed herein for predicting the feature distribution of a data stream. Based on the predicted feature distribution, one can make classification decisions to maximize the expected benefits. In addition, there is proposed herein the employment of a quality of decision (QoD) metric to measure the level of uncertainty in decisions and to guide load shedding. A load shedding scheme such as presented herein assigns available resources to multiple data streams to maximize the quality of classification decisions. Furthermore, such a load shedding scheme is able to learn and adapt to changing data characteristics in the data streams.Type: ApplicationFiled: February 16, 2005Publication date: August 17, 2006Applicant: IBM CorporationInventors: Yun Chi, Haixun Wang, Philip Yu
-
Publication number: 20060174024Abstract: Towards mining closed frequent itemsets over a sliding window using limited memory space, a synopsis data structure to monitor transactions in the sliding window so that one can output the current closed frequent itemsets at any time. Due to time and memory constraints, the synopsis data structure cannot monitor all possible itemsets, but monitoring only frequent itemsets makes it difficult to detect new itemsets when they become frequent. Herein, there is introduced a compact data structure, the closed enumeration tree (CET), to maintain a dynamically selected set of itemsets over a sliding-window. The selected itemsets include a boundary between closed frequent itemsets and the rest of the itemsets Because the boundary is relatively stable, the cost of mining closed frequent itemsets over a sliding window is dramatically reduced to that of mining transactions that can possibly cause boundary movements in the CET.Type: ApplicationFiled: January 31, 2005Publication date: August 3, 2006Applicant: IBM CorporationInventors: Yun Chi, Haixun Wang, Philip Yu
-
Publication number: 20060161575Abstract: Sequence-based XML indexing aims at avoiding expensive join operations in query processing. It transforms structured XML data into sequences so that a structured query can be answered holistically through subsequence matching. Herein, there is addresed the problem of query equivalence with respect to this transformation, and thereis introduced a performance-oriented principle for sequencing tree structures. With query equivalence, XML queries can be performed through subsequence matching without join operations, post-processing, or other special handling for problems such as false alarms. There is identified a class of sequencing methods for this purpose, and there is presented a novel subsequence matching algorithm that observe query equivalence. Also introduced is a performance-oriented principle to guide the sequencing of tree structures.Type: ApplicationFiled: January 14, 2005Publication date: July 20, 2006Applicant: IBM CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20060026110Abstract: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.Type: ApplicationFiled: July 30, 2004Publication date: February 2, 2006Applicant: IBM CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20060010093Abstract: In connection with the mining of time-evolving data streams, a general framework that mines changes and reconstructs models from a data stream with unlabeled instances or a limited number of labeled instances. In particular, there are defined herein statistical profiling methods that extend a classification tree in order to guess the percentage of drifts in the data stream without any labelled data. Exact error can be estimated by actively sampling a small number of true labels. If the estimated error is significantly higher than empirical expectations, there preferably re-sampled a small number of true labels to reconstruct the decision tree from the leaf node level.Type: ApplicationFiled: June 30, 2004Publication date: January 12, 2006Applicant: IBM CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20050278322Abstract: A general framework for mining concept-drifting data streams using weighted ensemble classifiers. An ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., is trained from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. An empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.Type: ApplicationFiled: May 28, 2004Publication date: December 15, 2005Applicant: IBM CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20050278324Abstract: Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including e-Commerce target marketing, bioinformatics (large scale scientific data analysis), and automatic computing (web usage analysis), etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle datasets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences.Type: ApplicationFiled: May 31, 2004Publication date: December 15, 2005Applicant: IBM CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20050131873Abstract: Disclosed in a method and structure for searching data in databases using an ensemble of models. First the invention performs training. This training orders models within the ensemble in order of prediction accuracy and joins different numbers of models together to form sub-ensembles. The models are joined together in the sub-ensemble in the order of prediction accuracy. Next in the training process, the invention calculates confidence values of each of the sub-ensembles. The confidence is a measure of how closely results form the sub-ensemble will match results from the ensemble. The size of each of the sub-ensembles is variable depending upon the level of confidence, while, to the contrary, the size of the ensemble is fixed. After the training, the invention can make a prediction. First, the invention selects a sub-ensemble that meets a given level of confidence.Type: ApplicationFiled: December 16, 2003Publication date: June 16, 2005Inventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20050125434Abstract: A method (and structure) for processing an inductive learning model for a dataset of examples, includes dividing the dataset into N subsets of data and developing an estimated learning model for the dataset by developing a learning model for a first subset of the N subsets.Type: ApplicationFiled: December 3, 2003Publication date: June 9, 2005Applicant: International Business Machines CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20050114331Abstract: Similarity searching techniques are provided. In one aspect, a method for use in finding near-neighbors in a set of objects comprises the following steps. Subspace pattern similarities that the objects in the set exhibit in multi-dimensional spaces are identified. Subspace correlations are defined between two or more of the objects in the set based on the identified subspace pattern similarities for use in identifying near-neighbor objects. A pattern distance index may be created. A method of performing a near-neighbor search of one or more query objects against a set of objects is also provided.Type: ApplicationFiled: November 26, 2003Publication date: May 26, 2005Applicant: International Business Machines CorporationInventors: Haixun Wang, Philip Yu
-
Publication number: 20050114314Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure.Type: ApplicationFiled: November 26, 2003Publication date: May 26, 2005Inventors: Wei Fan, Haixun Wang, Philip Yu