Patents by Inventor Haixun Wang

Haixun Wang has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20080052255
    Abstract: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.
    Type: Application
    Filed: October 31, 2007
    Publication date: February 28, 2008
    Applicant: International Business Machines Corporation
    Inventors: Wei Fan, Haixun Wang, Philip Yu
  • Patent number: 7337161
    Abstract: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.
    Type: Grant
    Filed: July 30, 2004
    Date of Patent: February 26, 2008
    Assignee: International Business Machines Corporation
    Inventors: Wei Fan, Haixun Wang, Philip S. Yu
  • Publication number: 20070288635
    Abstract: A computer implemented method, apparatus, and computer usable program code for processing multi-way stream correlations. Stream data are received for correlation. A task is formed for continuously partitioning a multi-way stream correlation workload into smaller workload pieces. Each of the smaller workload pieces may be processed by a single host. The stream data are sent to different hosts for correlation processing.
    Type: Application
    Filed: May 4, 2006
    Publication date: December 13, 2007
    Applicant: International Business Machines Corporation
    Inventors: Xiaohui Gu, Haixun Wang, Philip Yu
  • Publication number: 20070271243
    Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure.
    Type: Application
    Filed: July 19, 2007
    Publication date: November 22, 2007
    Inventors: Wei Fan, Haixun Wang, Philip Yu
  • Publication number: 20070260568
    Abstract: A dynamic rule classifier for mining a data stream includes at least one window for viewing data contained in the data stream and a set of rules for mining the data. Rules are added and the set of rules are updated by algorithms when an drift in a concept within the data occurs, causing unacceptable drops in classification accuracy. The dynamic rule classifier is also implemented as a method and a computer program product.
    Type: Application
    Filed: April 21, 2006
    Publication date: November 8, 2007
    Applicant: International Business Machines Corporation
    Inventors: Chang-shing Perng, Haixun Wang
  • Patent number: 7287023
    Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure.
    Type: Grant
    Filed: November 26, 2003
    Date of Patent: October 23, 2007
    Assignee: International Business Machines Corporation
    Inventors: Wei Fan, Haixun Wang, Philip Shi-Lung Yu
  • Publication number: 20070230488
    Abstract: There is provided a method for determining reachability between any two nodes within a graph. The inventive method utilizes a dual-labeling scheme. Initially, a spanning tree is defined for a group of nodes within a graph. Each node in the spanning tree is assigned a unique interval-based label, that describes its dependency from an ancestor node. Non-tree labels are then assigned to each node in the spanning tree that is connected to another node in the spanning tree by a non-tree link. From these labels, reachability of any two nodes in the spanning tree is determined by using only the interval-based labels and the non-tree labels.
    Type: Application
    Filed: March 31, 2006
    Publication date: October 4, 2007
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Philip Yu, Haixun Wang, Hao He
  • Patent number: 7243100
    Abstract: Attribute association discovery techniques that support relational-based data mining are disclosed. In one aspect of the invention, a technique for mining attribute associations in a relational data set comprises the following steps/operations. Multiple items are obtained from the relational data set. Then, attribute associations are discovered using: (i) multi-attribute mining templates formed from at least a portion of the multiple items; and (ii) one or more mining preferences specified by a user. The invention provides a novel architecture for the mining search space so as to exploit the inter-relationships among patterns of different templates. The framework is relational-sensitive and supports interactive and online mining.
    Type: Grant
    Filed: July 30, 2003
    Date of Patent: July 10, 2007
    Assignee: International Business Machines Corporation
    Inventors: Sheng Ma, Chang-shing Perng, Haixun Wang, Philip Shi-Lung Yu
  • Publication number: 20060271304
    Abstract: A method which identifies different types of substructures within a graph and encodes them using techniques suitable to the characteristics of each of them. The method is embodied by an efficient two-phase algorithm, where the first phase identifies and encodes strongly connected components as well as tree substructures, and the second phase encodes the remaining reachability relationships by compressing dense rectangular submatrices in the transitive closure matrix.
    Type: Application
    Filed: May 31, 2005
    Publication date: November 30, 2006
    Applicant: IBM Corporation
    Inventors: Hao He, Haixun Wang, Philip Yu
  • Publication number: 20060184527
    Abstract: Load shedding schemes for mining data streams. A scoring function is used to rank the importance of stream elements, and those elements with high importance are investigated. In the context of not knowing the exact feature values of a data stream, the use of a Markov model is proposed herein for predicting the feature distribution of a data stream. Based on the predicted feature distribution, one can make classification decisions to maximize the expected benefits. In addition, there is proposed herein the employment of a quality of decision (QoD) metric to measure the level of uncertainty in decisions and to guide load shedding. A load shedding scheme such as presented herein assigns available resources to multiple data streams to maximize the quality of classification decisions. Furthermore, such a load shedding scheme is able to learn and adapt to changing data characteristics in the data streams.
    Type: Application
    Filed: February 16, 2005
    Publication date: August 17, 2006
    Applicant: IBM Corporation
    Inventors: Yun Chi, Haixun Wang, Philip Yu
  • Publication number: 20060174024
    Abstract: Towards mining closed frequent itemsets over a sliding window using limited memory space, a synopsis data structure to monitor transactions in the sliding window so that one can output the current closed frequent itemsets at any time. Due to time and memory constraints, the synopsis data structure cannot monitor all possible itemsets, but monitoring only frequent itemsets makes it difficult to detect new itemsets when they become frequent. Herein, there is introduced a compact data structure, the closed enumeration tree (CET), to maintain a dynamically selected set of itemsets over a sliding-window. The selected itemsets include a boundary between closed frequent itemsets and the rest of the itemsets Because the boundary is relatively stable, the cost of mining closed frequent itemsets over a sliding window is dramatically reduced to that of mining transactions that can possibly cause boundary movements in the CET.
    Type: Application
    Filed: January 31, 2005
    Publication date: August 3, 2006
    Applicant: IBM Corporation
    Inventors: Yun Chi, Haixun Wang, Philip Yu
  • Publication number: 20060161575
    Abstract: Sequence-based XML indexing aims at avoiding expensive join operations in query processing. It transforms structured XML data into sequences so that a structured query can be answered holistically through subsequence matching. Herein, there is addresed the problem of query equivalence with respect to this transformation, and thereis introduced a performance-oriented principle for sequencing tree structures. With query equivalence, XML queries can be performed through subsequence matching without join operations, post-processing, or other special handling for problems such as false alarms. There is identified a class of sequencing methods for this purpose, and there is presented a novel subsequence matching algorithm that observe query equivalence. Also introduced is a performance-oriented principle to guide the sequencing of tree structures.
    Type: Application
    Filed: January 14, 2005
    Publication date: July 20, 2006
    Applicant: IBM Corporation
    Inventors: Wei Fan, Haixun Wang, Philip Yu
  • Publication number: 20060026110
    Abstract: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.
    Type: Application
    Filed: July 30, 2004
    Publication date: February 2, 2006
    Applicant: IBM Corporation
    Inventors: Wei Fan, Haixun Wang, Philip Yu
  • Publication number: 20060010093
    Abstract: In connection with the mining of time-evolving data streams, a general framework that mines changes and reconstructs models from a data stream with unlabeled instances or a limited number of labeled instances. In particular, there are defined herein statistical profiling methods that extend a classification tree in order to guess the percentage of drifts in the data stream without any labelled data. Exact error can be estimated by actively sampling a small number of true labels. If the estimated error is significantly higher than empirical expectations, there preferably re-sampled a small number of true labels to reconstruct the decision tree from the leaf node level.
    Type: Application
    Filed: June 30, 2004
    Publication date: January 12, 2006
    Applicant: IBM Corporation
    Inventors: Wei Fan, Haixun Wang, Philip Yu
  • Publication number: 20050278322
    Abstract: A general framework for mining concept-drifting data streams using weighted ensemble classifiers. An ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., is trained from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. An empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.
    Type: Application
    Filed: May 28, 2004
    Publication date: December 15, 2005
    Applicant: IBM Corporation
    Inventors: Wei Fan, Haixun Wang, Philip Yu
  • Publication number: 20050278324
    Abstract: Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including e-Commerce target marketing, bioinformatics (large scale scientific data analysis), and automatic computing (web usage analysis), etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle datasets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences.
    Type: Application
    Filed: May 31, 2004
    Publication date: December 15, 2005
    Applicant: IBM Corporation
    Inventors: Wei Fan, Haixun Wang, Philip Yu
  • Publication number: 20050131873
    Abstract: Disclosed in a method and structure for searching data in databases using an ensemble of models. First the invention performs training. This training orders models within the ensemble in order of prediction accuracy and joins different numbers of models together to form sub-ensembles. The models are joined together in the sub-ensemble in the order of prediction accuracy. Next in the training process, the invention calculates confidence values of each of the sub-ensembles. The confidence is a measure of how closely results form the sub-ensemble will match results from the ensemble. The size of each of the sub-ensembles is variable depending upon the level of confidence, while, to the contrary, the size of the ensemble is fixed. After the training, the invention can make a prediction. First, the invention selects a sub-ensemble that meets a given level of confidence.
    Type: Application
    Filed: December 16, 2003
    Publication date: June 16, 2005
    Inventors: Wei Fan, Haixun Wang, Philip Yu
  • Publication number: 20050125434
    Abstract: A method (and structure) for processing an inductive learning model for a dataset of examples, includes dividing the dataset into N subsets of data and developing an estimated learning model for the dataset by developing a learning model for a first subset of the N subsets.
    Type: Application
    Filed: December 3, 2003
    Publication date: June 9, 2005
    Applicant: International Business Machines Corporation
    Inventors: Wei Fan, Haixun Wang, Philip Yu
  • Publication number: 20050114331
    Abstract: Similarity searching techniques are provided. In one aspect, a method for use in finding near-neighbors in a set of objects comprises the following steps. Subspace pattern similarities that the objects in the set exhibit in multi-dimensional spaces are identified. Subspace correlations are defined between two or more of the objects in the set based on the identified subspace pattern similarities for use in identifying near-neighbor objects. A pattern distance index may be created. A method of performing a near-neighbor search of one or more query objects against a set of objects is also provided.
    Type: Application
    Filed: November 26, 2003
    Publication date: May 26, 2005
    Applicant: International Business Machines Corporation
    Inventors: Haixun Wang, Philip Yu
  • Publication number: 20050114314
    Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure.
    Type: Application
    Filed: November 26, 2003
    Publication date: May 26, 2005
    Inventors: Wei Fan, Haixun Wang, Philip Yu