Patents by Inventor Philip Yu
Philip Yu has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20060101056Abstract: Techniques are provided for performing structural joins for answering containment queries. Such inventive techniques may be used to perform efficient structural joins of two interval lists which are neither sorted nor pre-indexed. For example, in an illustrative aspect of the invention, a technique for performing structural joins of two element sets of a tree-structured document, wherein one of the two element sets is an ancestor element set and the other of the two element sets is a descendant element set, and further wherein each element is represented as an interval representing a start position and an end position of the element in the document, comprises the following steps/operations. An index is dynamically built for the ancestor element set. Then, one or more structural joins are performed by searching the index with the interval start position of each element in the descendant element set.Type: ApplicationFiled: November 5, 2004Publication date: May 11, 2006Applicant: International Business Machines CorporationInventors: Shyh-Kwei Chen, Kun-Lung Wu, Philip Yu
-
Publication number: 20060101045Abstract: Interval query indexing techniques for use in accordance with data stream processing systems are disclosed. For example, in an illustrative aspect of the invention, a technique for use in processing a data stream comprises the following steps/operations. First, an attribute range of query intervals associated with the data stream is partitioned into one or more segments. Then, a set of virtual intervals is defined for each of the one or more segments. A query interval index is then built using the set of virtual intervals. The query interval index may be built by decomposing each query interval into one or more of the virtual intervals, and associating a query identifier with the decomposed virtual intervals.Type: ApplicationFiled: November 5, 2004Publication date: May 11, 2006Applicant: International Business Machines CorporationInventors: Shyh-Kwei Chen, Kun-Lung Wu, Philip Yu
-
Publication number: 20060036564Abstract: Techniques for graph indexing are provided. In one aspect, a method for indexing graphs in a database, the graphs comprising graphic data, comprises the following steps. Frequent subgraphs among one or more of the graphs in the database are identified, the frequent subgraphs appearing in at least a threshold number of the graphs in the database. One or more of the frequent subgraphs are used to create an index of the graphs in the database.Type: ApplicationFiled: April 30, 2004Publication date: February 16, 2006Applicant: International Business Machines CorporationInventors: Xifeng Yan, Philip Yu
-
Publication number: 20060026110Abstract: Most recent research of scalable inductive learning on very large streaming dataset focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. There is discussed herein a general inductive learning framework that scans the dataset exactly once. Then, there is proposed an extension based on Hoeffding's inequality that scans the dataset less than once. The proposed frameworks are applicable to a wide range of inductive learners.Type: ApplicationFiled: July 30, 2004Publication date: February 2, 2006Applicant: IBM CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20060015474Abstract: Distributed privacy preserving data mining techniques are provided. A first entity of a plurality of entities in a distributed computing environment exchanges summary information with a second entity of the plurality of entities via a privacy-preserving data sharing protocol such that the privacy of the summary information is preserved, the summary information associated with an entity relating to data stored at the entity. The first entity may then mine data based on at least the summary information obtained from the second entity via the privacy-preserving data sharing protocol. The first entity may obtain, from the second entity via the privacy-preserving data sharing protocol, information relating to the number of transactions in which a particular itemset occurs and/or information relating to the number of transactions in which a particular rule is satisfied.Type: ApplicationFiled: July 16, 2004Publication date: January 19, 2006Applicant: International Business Machines CorporationInventors: Charu Aggarwal, Philip Yu
-
Publication number: 20060010093Abstract: In connection with the mining of time-evolving data streams, a general framework that mines changes and reconstructs models from a data stream with unlabeled instances or a limited number of labeled instances. In particular, there are defined herein statistical profiling methods that extend a classification tree in order to guess the percentage of drifts in the data stream without any labelled data. Exact error can be estimated by actively sampling a small number of true labels. If the estimated error is significantly higher than empirical expectations, there preferably re-sampled a small number of true labels to reconstruct the decision tree from the leaf node level.Type: ApplicationFiled: June 30, 2004Publication date: January 12, 2006Applicant: IBM CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20060004754Abstract: A technique for classifying data from a test data stream is provided. A stream of training data having class labels is received. One or more class-specific clusters of the training data are determined and stored. At least one test instance of the test data stream is classified using the one or more class-specific clusters.Type: ApplicationFiled: June 30, 2004Publication date: January 5, 2006Applicant: International Business Machines CorporationInventors: Charu Aggarwal, Philip Yu
-
Publication number: 20050283511Abstract: Disclosed is a method of automatically identifying anomalous situations during computerized system operations that records actions performed by the computerized system as features in a history file, automatically creates a model for each feature only from normal data in the history file, performs training by calculating anomaly scores of the features, establishes a threshold to evaluate whether features are abnormal, automatically identifies abnormal actions of the computerized system based on the anomaly scores and said threshold, and periodically repeats the training process.Type: ApplicationFiled: September 9, 2003Publication date: December 22, 2005Inventors: Wei Fan, Philip Yu
-
Publication number: 20050278324Abstract: Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including e-Commerce target marketing, bioinformatics (large scale scientific data analysis), and automatic computing (web usage analysis), etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle datasets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences.Type: ApplicationFiled: May 31, 2004Publication date: December 15, 2005Applicant: IBM CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20050278322Abstract: A general framework for mining concept-drifting data streams using weighted ensemble classifiers. An ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., is trained from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. An empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.Type: ApplicationFiled: May 28, 2004Publication date: December 15, 2005Applicant: IBM CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20050246262Abstract: Interoperability is enabled between participants in a network by determining values associated with a value metric defined for at least a portion of the network. Information flow is directed between two or more of the participants based at least in part on semantic models corresponding to the participants and on the values associated with the value metric. The semantic models may define interactions between the participants and define at least a portion of information produced or consumed by the participants. The determination of the values and the direction of the information flow may be performed multiple times in order to modify the one or more value metrics. The direction of information flow may allow participants to be deleted from the network, may allow participants to be added to the network, or may allow behavior of the participants to be modified.Type: ApplicationFiled: April 29, 2004Publication date: November 3, 2005Inventors: Charu Aggarwal, Murray Campbell, Yuan-Chi Chang, Matthew Hill, Chung-Sheng Li, Milind Naphade, Sriram Padmanabhan, John Smith, Min Wang, Kun-Lung Wu, Philip Yu
-
Publication number: 20050234877Abstract: The present invention is directed to a system and a method for generating a temporally ranked set of search results in response to a query. Each result in the set of search results can be ranked temporally or based on the reputation associated with authors of each result and the reputation associated with the repository where each result is located. Temporal ranking takes into account a present importance weight and a future importance weight are assigned to each result. The present importance of each result uses creation date, publication date, in-link dates and search frequency, and the future importance uses an aging factor based on the elapsed time from publication for each search result and a rate at which each search result decreases in importance. Temporal ranking can be applied as a modification of existing and common search engine algorithms include PageRank and HITS.Type: ApplicationFiled: April 8, 2004Publication date: October 20, 2005Inventor: Philip Yu
-
Publication number: 20050210027Abstract: Techniques for monitoring abnormalities in a data stream are provided. A plurality of objects are received from the data stream and one or more clusters are created from these objects. At least a portion of the one or more clusters have statistical data of the respective cluster. It is determined from the statistical data whether one or more abnormalities exist in the data stream.Type: ApplicationFiled: March 16, 2004Publication date: September 22, 2005Applicant: International Business Machines CorporationInventors: Charu Aggarwal, Philip Yu
-
Publication number: 20050193110Abstract: Techniques are provided for improved serving of content in a distributed data network. In one aspect of the invention, a technique for delivering content in a client-server system based on a request from a client comprises the following steps/operations. The request is obtained. A performance characteristic of at least one server or at least one cache of the client-server system is determined. Then, a level of data accuracy to be delivered to the client in response to the request is determined. The data accuracy determination is based on: (i) the determined performance characteristic of the at least one server or the at least one cache; and (ii) at least one preference associated with the client. The performance characteristic may comprise a load of the at least one server or the at least one cache. The level of data accuracy may comprise a level of personalization to be delivered to the client in response to the request.Type: ApplicationFiled: February 27, 2004Publication date: September 1, 2005Applicant: International Business Machines CorporationInventors: Paul Dantzig, Daniel Dias, Arun Ivengar, Philip Yu
-
Publication number: 20050177545Abstract: Techniques are provided for representing and managing data and associated relationships. In one aspect of the invention, a technique for managing data associated with a given domain comprises the following steps. A specification of data attributes representing one or more types of data to be managed is maintained. Further, a specification of algorithms representing one or more types of operations performable in accordance with the data attributes is maintained. Still further, a specification of relationships representing relationships between the data attributes and the algorithms is maintained. The data attribute specification, the algorithm specification and the relationship specification are maintained in a storage framework having multiple levels, the multiple levels being specified based on the given domain with which the data being managed is associated. The techniques may be provided in support of service level management.Type: ApplicationFiled: February 11, 2004Publication date: August 11, 2005Applicant: International Business Machines CorporationInventors: Melissa Buco, Rong Chang, Laura Luan, Zon-Yin Shae, Christopher Ward, Joel Wolf, Philip Yu
-
Publication number: 20050131873Abstract: Disclosed in a method and structure for searching data in databases using an ensemble of models. First the invention performs training. This training orders models within the ensemble in order of prediction accuracy and joins different numbers of models together to form sub-ensembles. The models are joined together in the sub-ensemble in the order of prediction accuracy. Next in the training process, the invention calculates confidence values of each of the sub-ensembles. The confidence is a measure of how closely results form the sub-ensemble will match results from the ensemble. The size of each of the sub-ensembles is variable depending upon the level of confidence, while, to the contrary, the size of the ensemble is fixed. After the training, the invention can make a prediction. First, the invention selects a sub-ensemble that meets a given level of confidence.Type: ApplicationFiled: December 16, 2003Publication date: June 16, 2005Inventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20050125434Abstract: A method (and structure) for processing an inductive learning model for a dataset of examples, includes dividing the dataset into N subsets of data and developing an estimated learning model for the dataset by developing a learning model for a first subset of the N subsets.Type: ApplicationFiled: December 3, 2003Publication date: June 9, 2005Applicant: International Business Machines CorporationInventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20050114331Abstract: Similarity searching techniques are provided. In one aspect, a method for use in finding near-neighbors in a set of objects comprises the following steps. Subspace pattern similarities that the objects in the set exhibit in multi-dimensional spaces are identified. Subspace correlations are defined between two or more of the objects in the set based on the identified subspace pattern similarities for use in identifying near-neighbor objects. A pattern distance index may be created. A method of performing a near-neighbor search of one or more query objects against a set of objects is also provided.Type: ApplicationFiled: November 26, 2003Publication date: May 26, 2005Applicant: International Business Machines CorporationInventors: Haixun Wang, Philip Yu
-
Publication number: 20050114314Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure.Type: ApplicationFiled: November 26, 2003Publication date: May 26, 2005Inventors: Wei Fan, Haixun Wang, Philip Yu
-
Publication number: 20050114298Abstract: The present invention provides an index structure for managing weighted-sequences in large databases. A weighted-sequence is defined as a two-dimensional structure in which each element in the sequence is associated with a weight. A series of network events, for instance, is a weighted-sequence because each event is associated with a timestamp. Querying a large sequence database by events' occurrence patterns is a first step towards understanding the temporal causal relationships among the events. The index structure proposed herein enables the efficient retrieval from the database of all subsequences (contiguous and non-contiguous) that match a given query sequence both by events and by weights. The index structure also takes into consideration the nonuniform frequency distribution of events in the sequence data.Type: ApplicationFiled: November 26, 2003Publication date: May 26, 2005Inventors: Wei Fan, Chang-Shing Perng, Haixun Wang, Philip Yu