Patents by Inventor Daniel F. Gruhl

Daniel F. Gruhl has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Holistic disambiguation for entity name spotting

Patent number: 8856119

Abstract: A method resolves ambiguous spotted entity names in a data corpus by determining an activation level value for each of a plurality of nodes corresponding to a single ambiguous entity name. The activation levels for each of the nodes may be modified by inputting outside domain knowledge corresponding to the nodes to increase the activation value of the nodes, spotting entity names corresponding to the nodes to increase the activation value of the nodes, searching the data corpus to spot newly posted entity names to increase the activation value of the nodes, and searching the data corpus to reduce or deactivate the activation value of the nodes by eliminating false positives. The ambiguous entity name is assigned to the node determined to have the highest activation level and is then outputted to a user.

Type: Grant

Filed: February 27, 2009

Date of Patent: October 7, 2014

Assignee: International Business Machines Corporation

Inventors: Varun Bhagwan, Tyrone W. A. Grandison, Daniel F. Gruhl, Jan H. Pieper
Data deduplication for streaming sequential data storage applications

Patent number: 8407193

Abstract: Data deduplication compression in a streaming storage application, is provided. The disclosed deduplication process provides a deduplication archive that enables storage of the archive to, and extraction from, a streaming storage medium. One implementation involves compressing fully sequential data stored in a data repository to a sequential streaming storage, by: splitting fully sequential data into data blocks; hashing content of each data block and comparing each hash to an in-memory lookup table for a match, the in-memory lookup table storing all hashes that have been encountered during the compression of the fully sequential data; for each data block without a hash match, adding the data block as a new data block for compression of fully sequential data; and encoding duplicate data blocks using the in-memory lookup table into data segments.

Type: Grant

Filed: January 27, 2010

Date of Patent: March 26, 2013

Assignee: International Business Machines Corporation

Inventors: Daniel F. Gruhl, Jan H. Pieper, Mark A. Smith
Method and apparatus for data compression

Patent number: 8380688

Abstract: A method, system, and article for compressing an input stream of uncompressed data. The input stream is divided into one or more data segments. A hash is applied to a first data segment, and an offset and length are associated with this first segment. This hash, together with the offset and length data for the first segment, is stored in a hash table. Thereafter, a subsequent segment within the input stream is evaluated and compared with all other hash entries in the hash table, and a reference is written to a prior hash for an identified duplicate segment. The reference includes a new offset location for the subsequent segment. Similarly, a new hash is applied to an identified non-duplicate segment, with the new hash and its corresponding offset stored in the hash table. A compressed output stream of data is created from the hash table retained on storage media.

Type: Grant

Filed: November 6, 2009

Date of Patent: February 19, 2013

Assignee: International Business Machines Corporation

Inventors: Daniel F. Gruhl, Jan H. Pieper, Mark A. Smith
System for monitoring global online opinions via semantic extraction

Patent number: 8352412

Abstract: A system for transforming domain specific unstructured data into structured data including an intake platform controlled by feed back from a control platform. The intake platform includes an intake acquisition module for acquiring data building baseline data related to a domain and problem of interest, an intake pre-processing module, an intake language module, an intake application descriptors module, and an intake adjudication module. The control platform includes a control data acquisition module, a control data consistency collator, a control auditor, a control event definition and policy repository, an error resolver, and an output that outputs results of the workflow into structured data enabled to be used in data analytics.

Type: Grant

Filed: February 27, 2009

Date of Patent: January 8, 2013

Assignee: International Business Machines Corporation

Inventors: Alfredo Alba, Varun Bhagwan, Tyrone W. A. Grandison, Daniel F. Gruhl, Jan H. Pieper
DATA INGEST OPTIMIZATION

Publication number: 20120330972

Abstract: Methods and systems for optimizing the retrieval of data from multiple sources are described. A slot map including slots for the storage of data elements can be obtained. The data elements associated with the slots can be prioritized by weighting values with costs of retrieving the data elements from respective data sources. Each value can be associated with a different data element and can indicate a respective degree of importance of the associated data element. Further, the systems and methods can direct the retrieval of data elements from the respective data sources in an order in accordance with the priority of the data elements to optimize the quality of data obtainable within a critical time constraint. In addition, the retrieved data elements can be stored in corresponding slots on a storage medium.

Type: Application

Filed: September 5, 2012

Publication date: December 27, 2012

Applicant: International Business Machines Corporation

Inventors: Varun Bhagwan, Tyrone W.A. Grandison, Daniel F. Gruhl
VALIDATION OF INGESTED DATA

Publication number: 20120330901

Abstract: Methods and systems for validating ingested data are disclosed. In accordance with the methods and systems, data elements can be received for storage in slots of an individual descriptor in a storage medium. In addition, at least one validation test can be selected based on a weighting of the data elements that indicates a respective degree of importance of the data elements. The selected validation test or tests can be applied to the data elements stored in the slots to generate respective validation results. Further, a validation score indicating a sufficiency of the stored data elements can be generated based on the validation results.

Type: Application

Filed: September 5, 2012

Publication date: December 27, 2012

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Varun Bhagwan, Tyrone W. A. Grandison, Daniel F. Gruhl, Kilian M. Pohl
DATA INGEST OPTIMIZATION

Publication number: 20120197902

Abstract: Methods and systems for optimizing the retrieval of data from multiple sources are described. A slot map including slots for the storage of data elements can be obtained. The data elements associated with the slots can be prioritized by weighting values with costs of retrieving the data elements from respective data sources. Each value can be associated with a different data element and can indicate a respective degree of importance of the associated data element. Further, the systems and methods can direct the retrieval of data elements from the respective data sources in an order in accordance with the priority of the data elements to optimize the quality of data obtainable within a critical time constraint. In addition, the retrieved data elements can be stored in corresponding slots on a storage medium.

Type: Application

Filed: January 28, 2011

Publication date: August 2, 2012

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Varun BHAGWAN, Tyrone W.A. GRANDISON, Daniel F. GRUHL
VALIDATION OF INGESTED DATA

Publication number: 20120197848

Abstract: Methods and systems for validating ingested data are disclosed. In accordance with the methods and systems, data elements can be received for storage in slots of an individual descriptor in a storage medium. In addition, at least one validation test can be selected based on a weighting of the data elements that indicates a respective degree of importance of the data elements. The selected validation test or tests can be applied to the data elements stored in the slots to generate respective validation results. Further, a validation score indicating a sufficiency of the stored data elements can be generated based on the validation results.

Type: Application

Filed: January 28, 2011

Publication date: August 2, 2012

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Varun Bhagwan, Tyrone W. A. Grandison, Daniel F. Gruhl, Kilian M. Pohl
DATA DEDUPLICATION FOR STREAMING SEQUENTIAL DATA STORAGE APPLICATIONS

Publication number: 20110185149

Abstract: Data deduplication compression in a streaming storage application, is provided. The disclosed deduplication process provides a deduplication archive that enables storage of the archive to, and extraction from, a streaming storage medium. One implementation involves compressing fully sequential data stored in a data repository to a sequential streaming storage, by: splitting fully sequential data into data blocks; hashing content of each data block and comparing each hash to an in-memory lookup table for a match, the in-memory lookup table storing all hashes that have been encountered during the compression of the fully sequential data; for each data block without a hash match, adding the data block as a new data block for compression of fully sequential data; and encoding duplicate data blocks using the in-memory lookup table into data segments.

Type: Application

Filed: January 27, 2010

Publication date: July 28, 2011

Applicant: International Business Machines Corporation

Inventors: Daniel F. Gruhl, Jan H. Pieper, Mark A. Smith
System and method for adaptive content processing and classification in a high-availability environment

Patent number: 7966270

Abstract: The embodiments of the invention provide a systems, methods, etc. for adaptive content processing and classification in a high-availability environment. More specifically, a system is provided having a plurality of processing engines and at least one server that classifies data objects on the computer system. The classification includes analyzing the data objects for the presence of a type of content. This can include assigning a score corresponding to the amount of the type of content in each of the data objects. Moreover, the server can remove a data object from the computer system based on the results of the analyzing. The results of the analyzing are stored and the computer system is updated with feedback information. This can include allowing a user to review the results of the analyzing and aggregating reviews of the user into the feedback information.

Type: Grant

Filed: February 23, 2007

Date of Patent: June 21, 2011

Assignee: International Business Machines Corporation

Inventors: Varun Bhagwan, Daniel F. Gruhl, Kevin Haas, Jeffrey A. Kusnitz, Daniel N. Meredith
Method and Apparatus for Data Compression

Publication number: 20110113016

Abstract: A method, system, and article for compressing an input stream of uncompressed data. The input stream is divided into one or more data segments. A hash is applied to a first data segment, and an offset and length are associated with this first segment. This hash, together with the offset and length data for the first segment, is stored in a hash table. Thereafter, a subsequent segment within the input stream is evaluated and compared with all other hash entries in the hash table, and a reference is written to a prior hash for an identified duplicate segment. The reference includes a new offset location for the subsequent segment. Similarly, a new hash is applied to an identified non-duplicate segment, with the new hash and its corresponding offset stored in the hash table. A compressed output stream of data is created from the hash table retained on storage media.

Type: Application

Filed: November 6, 2009

Publication date: May 12, 2011

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Daniel F. Gruhl, Jan H. Pieper, Mark A. Smith
Anonymization of Unstructured Data

Publication number: 20110113049

Abstract: A method for anonymization of unstructured data comprises determining structured references in the unstructured data; populating a table with the structured references; anonymizing the structured references in the table using ontological analysis; and rewriting the structured references in the unstructured data with the anonymized structured references from the table to produce anonymized data. A system for anonymizing unstructured data comprises an entity spotting module configured to determine structured references in the unstructured data and populate a table with the determined structured references; an anonymization module configured to anonymizing the structured references in the table using ontological analysis; and a replacement module configured to rewrite the structured references in the unstructured data with the anonymized structured references from the table to produce anonymized data.

Type: Application

Filed: November 9, 2009

Publication date: May 12, 2011

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Matthew A. Davis, Daniel F. Gruhl
System and method for bulk processing of semi-structured result streams from multiple resources

Patent number: 7877484

Abstract: A system and associated method for bulk processing of semi-structured results streams from many different resources ingest bytes, parse as many bytes as practical, and return to process additional bytes. The system processes network packets as they arrive from a computing resource, creating intermediate results. The intermediate results are held in a stack until sufficient information is accumulated. The system then merges the intermediate results to form a single document model. As network packets at one connection are consumed by the system, the system can select another connection at which packets are waiting for processing. The processing of a result at a connection can be interrupted while the system processes the results at another connection. In this manner, the system is able to utilize one thread to process many incoming results in parallel.

Type: Grant

Filed: April 23, 2004

Date of Patent: January 25, 2011

Assignee: International Business Machines Corporation

Inventors: Roberto Javier Bayardo, Daniel F. Gruhl
HOLISTIC DISAMBIGUATION FOR ENTITY NAME SPOTTING

Publication number: 20100223292

Abstract: A method resolves ambiguous spotted entity names in a data corpus by determining an activation level value for each of a plurality of nodes corresponding to a single ambiguous entity name. The activation levels for each of the nodes may be modified by inputting outside domain knowledge corresponding to the nodes to increase the activation value of the nodes, spotting entity names corresponding to the nodes to increase the activation value of the nodes, searching the data corpus to spot newly posted entity names to increase the activation value of the nodes, and searching the data corpus to reduce or deactivate the activation value of the nodes by eliminating false positives. The ambiguous entity name is assigned to the node determined to have the highest activation level and is then outputted to a user.

Type: Application

Filed: February 27, 2009

Publication date: September 2, 2010

Applicant: International Business Machines Corporation

Inventors: Varun Bhagwan, Tyrone W.A. Grandison, Daniel F. Gruhl, Jan H. Pieper
SYSTEM FOR MONITORING GLOBAL ONLINE OPINIONS VIA SEMANTIC EXTRACTION

Publication number: 20100223226

Abstract: A system for transforming domain specific unstructured data into structured data including an intake platform controlled by feed back from a control platform. The intake platform includes an intake acquisition module for acquiring data building baseline data related to a domain and problem of interest, an intake pre-processing module, an intake language module, an intake application descriptors module, and an intake adjudication module. The control platform includes a control data acquisition module, a control data consistency collator, a control auditor, a control event definition and policy repository, an error resolver, and an output that outputs results of the workflow into structured data enabled to be used in data analytics.

Type: Application

Filed: February 27, 2009

Publication date: September 2, 2010

Applicant: International Business Machines Corporation

Inventors: Alfredo Alba, Varun Bhagwan, Tyrone W.A. Grandison, Daniel F. Gruhl, Jan H. Pieper
Fast-approximate TFIDF

Patent number: 7730061

Abstract: Our approach seeks to reduce the complexity of this type of calculation through approximation and pre-computation. It is designed to work efficiently with modern relational database constructs for content management. The approach is designed to enable the kinds of highly interactive data-driven visualizations that are the hallmark of third generation business intelligence.

Type: Grant

Filed: September 12, 2008

Date of Patent: June 1, 2010

Assignee: International Business Machines Corporation

Inventors: Daniel F. Gruhl, Christine M Robson
FAST-APPROXIMATE TFIDF

Publication number: 20100070495

Abstract: Our approach seeks to reduce the complexity of this type of calculation through approximation and pre-computation. It is designed to work efficiently with modern relational database constructs for content management. The approach is designed to enable the kinds of highly interactive data-driven visualizations that are the hallmark of third generation business intelligence.

Type: Application

Filed: September 12, 2008

Publication date: March 18, 2010

Applicant: International Business Machines Corporation

Inventors: Daniel F. Gruhl, Christine M. Robson
Content monitoring in a high volume on-line community application

Patent number: 7523138

Abstract: Disclosed are embodiments a system and method for managing an on-line community. Electronic postings are pre-screened based on one or more metrics to determine a risk value indicative of the likelihood that an individual posting contains objectionable content. These metrics are based on the profile of a poster, including various parameters of the poster and/or the poster's record of objectionable content postings. These metrics can also be based on the social network profile of a poster, including the average of various parameters of other users in the poster's social network and/or a compiled record of objectionable content postings of other users in the poster's social network. If the risk value is relatively low, the posting can be displayed to the on-line community immediately. If the risk value is relatively high, display of the posting can be delayed until further content analysis is completed.

Type: Grant

Filed: January 11, 2007

Date of Patent: April 21, 2009

Assignee: International Business Machines Corporation

Inventors: Daniel F. Gruhl, Kevin Haas
SYSTEM AND METHOD FOR ADAPTIVE CONTENT PROCESSING AND CLASSIFICATION IN A HIGH-AVAILABILITY ENVIRONMENT

Publication number: 20080208893

Abstract: The embodiments of the invention provide a systems, methods, etc. for adaptive content processing and classification in a high-availability environment. More specifically, a system is provided having a plurality of processing engines and at least one server that classifies data objects on the computer system. The classification includes analyzing the data objects for the presence of a type of content. This can include assigning a score corresponding to the amount of the type of content in each of the data objects. Moreover, the server can remove a data object from the computer system based on the results of the analyzing. The results of the analyzing are stored and the computer system is updated with feedback information. This can include allowing a user to review the results of the analyzing and aggregating reviews of the user into the feedback information.

Type: Application

Filed: February 23, 2007

Publication date: August 28, 2008

Inventors: Varun Bhagwan, Daniel F. Gruhl, Kevin Haas, Jeffrey A. Kusnitz, Daniel N. Meredith
CONTENT MONITORING IN A HIGH VOLUME ON-LINE COMMUNITY APPLICATION

Publication number: 20080177834

Abstract: Disclosed are embodiments a system and method for managing an on-line community. Electronic postings are pre-screened based on one or more metrics to determine a risk value indicative of the likelihood that an individual posting contains objectionable content. These metrics are based on the profile of a poster, including various parameters of the poster and/or the poster's record of objectionable content postings. These metrics can also be based on the social network profile of a poster, including the average of various parameters of other users in the poster's social network and/or a compiled record of objectionable content postings of other users in the poster's social network. If the risk value is relatively low, the posting can be displayed to the on-line community immediately. If the risk value is relatively high, display of the posting can be delayed until further content analysis is completed.

Type: Application

Filed: March 26, 2008

Publication date: July 24, 2008

Applicant: International Business Machines Corporation

Inventors: Daniel F. Gruhl, Kevin Haas

prev 1 2 3 4 5 next