Patents by Inventor Yannick Saillet

Yannick Saillet has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

PROCESSING DATA SETS IN A BIG DATA REPOSITORY

Publication number: 20180349184

Abstract: The invention provides for a method for processing a plurality of data sets (105; 106; 108; 110-113; DB1; DB2) in a data repository (104) for storing at least unstructured data, the method comprising: —providing (302) a set of agents (150-168), each agent being operable to trigger the processing of one or more of the data sets, the execution of each of said agents being automatically triggered in case one or more conditions assigned to said agent are met, at least one of the conditions relating to the existence, structure, content and/or annotations of the data set whose processing can be triggered by said agent; —executing (304) a first one of the agents; —updating (306) the annotations (115) of the first data set by the first agent; and —executing (308) a second one of the agents, said execution being triggered by the updated annotations of the first data set meeting the conditions of the second agent, thereby triggering a further updating of the annotations of the first data set.

Type: Application

Filed: August 14, 2018

Publication date: December 6, 2018

Inventors: Albert Maier, Yannick Saillet, Harald C. Smith, Daniel C. Wolfson
Method for classifying an unmanaged dataset

Patent number: 10055430

Abstract: A computer implemented method for classifying at least one source dataset of a computer system. The method may include providing a plurality of associated reference tables organized and associated in accordance with a reference storage model in the computer system. The method may also include calculating, by a data classifier application of the computer system, a first similarity score between the source dataset and a first reference table of the reference tables based on common attributes in the source dataset and a join of the first reference table with at least one further reference table of the reference tables having a relationship with the first reference table. The method may further include classifying, by the data classifier application, the source dataset by determining using at least the calculated first similarity score whether the source dataset is organized as the first reference table in accordance to the reference storage model.

Type: Grant

Filed: October 14, 2015

Date of Patent: August 21, 2018

Assignee: International Business Machines Corporation

Inventors: Martin Oberhofer, Adapala S. Reddy, Yannick Saillet, Jens Seifert
Method and system for accessing a set of data tables in a source database

Patent number: 9996558

Abstract: Embodiments relate to accessing a set of data tables in a source database. A set of table categories is provided for tables in the source database and a set of metrics is provided. For each table of the set of the data tables: the set of metrics is evaluated, the evaluated set of metrics is analyzed, and the table is categorized into one of the set of table categories using the result of the analysis. Information indicative of the table category of each table of the set of tables is output, and in response, a request to select data tables of the set of data tables is received according to a part of the table categories for data processing. A subset of data tables of the set of data tables is selected using the table categories for performing the data processing on the subset of data tables.

Type: Grant

Filed: September 3, 2014

Date of Patent: June 12, 2018

Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Sebastian Nelke, Martin Oberhofer, Yannick Saillet, Jens Seifert
COMPUTING THE NEED FOR STANDARDIZATION OF A SET OF VALUES

Publication number: 20180137148

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Application

Filed: November 11, 2016

Publication date: May 17, 2018

Inventors: Namit Kabra, Yannick Saillet
EFFICIENTLY FINDING POTENTIAL DUPLICATE VALUES IN DATA

Publication number: 20180137189

Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.

Type: Application

Filed: November 11, 2016

Publication date: May 17, 2018

Inventors: Namit Kabra, Yannick Saillet
COMPUTING THE NEED FOR STANDARDIZATION OF A SET OF VALUES

Publication number: 20180137151

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Application

Filed: December 5, 2017

Publication date: May 17, 2018

Inventors: Namit Kabra, Yannick Saillet
EFFICIENTLY FINDING POTENTIAL DUPLICATE VALUES IN DATA

Publication number: 20180137193

Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.

Type: Application

Filed: December 7, 2017

Publication date: May 17, 2018

Inventors: Namit Kabra, Yannick Saillet
MULTIPLE RECORD LINKAGE ALGORITHM SELECTOR

Publication number: 20180121535

Abstract: The present disclosure relates to a method for centrally processing data records using a record linkage algorithm. The method comprises providing a centralized master repository for storing data records in a predefined data structure having a set of attributes. At least one clustering metric is provided. Clusters of records may be determined using a clustering function that is based on the at least one clustering metric. For each particular cluster, a set of configuration data for the record linkage algorithm may be defined based on a value of the clustering metric within that particular cluster. The individual data records may be assigned to one or more clusters of the clusters using the clustering metric values and the record linkage algorithm may be applied to a set of two or more individual data records assigned to at least one common cluster using the set of configuration data for the common cluster.

Type: Application

Filed: January 2, 2018

Publication date: May 3, 2018

Inventors: Martin Oberhofer, Yannick Saillet, Scott Schumacher, Jens P. Seifert
MULTIPLE RECORD LINKAGE ALGORITHM SELECTOR

Publication number: 20180113928

Abstract: The present disclosure relates to a method for centrally processing data records using a record linkage algorithm. The method comprises providing a centralized master repository for storing data records in a predefined data structure having a set of attributes. At least one clustering metric is provided. Clusters of records may be determined using a clustering function that is based on the at least one clustering metric. For each particular cluster, a set of configuration data for the record linkage algorithm may be defined based on a value of the clustering metric within that particular cluster. The individual data records may be assigned to one or more clusters of the clusters using the clustering metric values and the record linkage algorithm may be applied to a set of two or more individual data records assigned to at least one common cluster using the set of configuration data for the common cluster.

Type: Application

Filed: October 21, 2016

Publication date: April 26, 2018

Inventors: Martin Oberhofer, Yannick Saillet, Scott Schumacher, Jens P. Seifert
RULE GENERATION IN A DATA GOVERNANCE FRAMEWORK

Publication number: 20180101538

Abstract: The invention relates to computer-implemented method for supplementing a data governance framework with one or more new data governance technical rules. The method comprises providing a plurality of expressions and a first mapping. The expressions assign natural language patterns to technical language patterns. The first mapping maps first terms to data sources. A rule generator receives a new natural language (NL) rule comprising one or more natural-language patterns and one or more first terms. The rule generator resolves the new NL rule into one or more new technical rules interpretable by a respective rule engine and stores the one or more technical rules in a rule repository.

Type: Application

Filed: December 11, 2017

Publication date: April 12, 2018

Inventors: Mike W. Grasselt, Yannick Saillet, Marvin Schaefer
MODEL-DRIVEN PROFILING JOB GENERATOR FOR DATA SOURCES

Publication number: 20180096038

Abstract: Embodiments of the present invention disclose generating a data profiling jobs for source data in a data processing system, the source data being described by at least one source functional data model. A target functional data model is provided, for describing target data that can be generated from the source data. One or more source functional data models are identified that correspond to the target functional data model. At least one functional source-to-target model mapping is associated to at least one source-target pair based on the target functional data model and identified source functional data models. A physical source-to-target model mapping for at least one source-target pair based on the logical source-to-target model mapping is calculated. For all physical source attributes, the needed data profiling jobs are generated based on the target attribute for analyzing the physical source attributes.

Type: Application

Filed: December 6, 2017

Publication date: April 5, 2018

Inventors: Sebastian Nelke, Martin Oberhofer, Yannick Saillet, Jens P. Seifert
Method and System for Deduplicating Data

Publication number: 20180089233

Abstract: A mechanism is provided for deduplicating a set of records of data. Each record of the set of records has a set of attributes. The mechanism identifies a subset of records of the set of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute of the subset of the set of records the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records of the set of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record.

Type: Application

Filed: September 26, 2016

Publication date: March 29, 2018

Inventors: Namit Kabra, Yannick Saillet
Method and System for Deduplicating Data

Publication number: 20180089235

Abstract: A mechanism is provided for deduplicating a set of records of data. Each record of the set of records has a set of attributes. The mechanism identifies a subset of records of the set of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute of the subset of the set of records the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records of the set of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record.

Type: Application

Filed: November 28, 2017

Publication date: March 29, 2018

Inventors: Namit Kabra, Yannick Saillet
COLUMN WEIGHT CALCULATION FOR DATA DEDUPLICATION

Publication number: 20180067973

Abstract: A method to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.

Type: Application

Filed: November 8, 2017

Publication date: March 8, 2018

Inventors: Namit Kabra, Yannick Saillet
MODEL-DRIVEN PROFILING JOB GENERATOR FOR DATA SOURCES

Publication number: 20180039680

Abstract: Embodiments of the present invention disclose generating a data profiling jobs for source data in a data processing system, the source data being described by at least one source functional data model. A target functional data model is provided, for describing target data that can be generated from the source data. One or more source functional data models are identified that correspond to the target functional data model. At least one functional source-to-target model mapping is associated to at least one source-target pair based on the target functional data model and identified source functional data models. A physical source-to-target model mapping for at least one source-target pair based on the logical source-to-target model mapping is calculated. For all physical source attributes, the needed data profiling jobs are generated based on the target attribute for analyzing the physical source attributes.

Type: Application

Filed: August 4, 2016

Publication date: February 8, 2018

Inventors: Sebastian Nelke, Martin Oberhofer, Yannick Saillet, Jens P. Seifert
Application cache profiler

Patent number: 9870419

Abstract: In an embodiment of the invention, a method for data profiling incorporating an enterprise service bus (ESB) coupling the target and source systems following an extraction, transformation, and loading (ETL) process for a target system and a source system is provided. The method includes receiving baseline data profiling results obtained during ETL from a source application to a target application, caching the updates, determining current data profiling results within the ESB for cached updates, and triggering an action if a threshold disparity is detected upon the current data profiling results and the baseline data profiling results.

Type: Grant

Filed: February 28, 2012

Date of Patent: January 16, 2018

Assignee: International Business Machines Corporation

Inventors: Sebastian Nelke, Martin Oberhofer, Yannick Saillet, Jens Seifert
Application cache profiler

Patent number: 9870418

Abstract: In an embodiment of the invention, a method for data profiling incorporating an enterprise service bus (ESB) coupling the target and source systems following an extraction, transformation, and loading (ETL) process for a target system and a source system is provided. The method includes receiving baseline data profiling results obtained during ETL from a source application to a target application, caching the updates, determining current data profiling results within the ESB for cached updates, and triggering an action if a threshold disparity is detected upon the current data profiling results and the baseline data profiling results.

Type: Grant

Filed: December 31, 2010

Date of Patent: January 16, 2018

Assignee: International Business Machines Corporation

Inventors: Sebastian Nelke, Martin Oberhofer, Yannick Saillet, Jens Seifert
COLUMN WEIGHT CALCULATION FOR DATA DEDUPLICATION

Publication number: 20170351717

Abstract: A computer system with the capability to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.

Type: Application

Filed: June 2, 2016

Publication date: December 7, 2017

Inventors: Namit Kabra, Yannick Saillet
RULE GENERATION IN A DATA GOVERNANCE FRAMEWORK

Publication number: 20170329788

Abstract: The invention relates to computer-implemented method for supplementing a data governance framework with one or more new data governance technical rules. The method comprises providing a plurality of expressions and a first mapping. The expressions assign natural language patterns to technical language patterns. The first mapping maps first terms to data sources. A rule generator receives a new natural language (NL) rule comprising one or more natural-language patterns and one or more first terms. The rule generator resolves the new NL rule into one or more new technical rules interpretable by a respective rule engine and stores the one or more technical rules in a rule repository.

Type: Application

Filed: May 10, 2016

Publication date: November 16, 2017

Inventors: Mike W. Grasselt, Yannick Saillet, Marvin Schaefer
Generation of analysis reports using trusted and public distributed file systems

Patent number: 9779266

Abstract: The invention provides for a data processing system comprising an application server comprising at least one processor. Execution of the instructions cause the processor to: receive an analysis request, the analysis request comprising multiple data analysis commands for generating an analysis report descriptive of a structured data file; divide the commands into private analysis commands and public analysis commands; send the private analysis commands to a trusted distributed file system; send a portion of the public analysis commands to a public distributed file system; send a remainder of the public analysis commands to the trusted distributed file system; and generate the analysis report using public analysis results from the public distributed file system and trusted analysis results from the trusted distributed file system.

Type: Grant

Filed: February 26, 2015

Date of Patent: October 3, 2017

Assignee: International Business Machines Corporation

Inventors: Sebastian Nelke, Martin A. Oberhofer, Yannick Saillet, Jens Seifert

prev … 2 3 4 5 6 7 8 9 10 next