Patents by Inventor Yannick Saillet

Yannick Saillet has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Efficiently finding potential duplicate values in data

Patent number: 10719536

Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.

Type: Grant

Filed: December 7, 2017

Date of Patent: July 21, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
Cognitive data anonymization

Patent number: 10719627

Abstract: A computer implemented method for data anonymization comprises: receiving a request for data that needs anonymization. The request comprises at least one field descriptor of data to be retrieved and a usage scenario of a user for the requested data. Then, based on the usage scenario, an anonymization algorithm to be applied to the data that is referred to by the field descriptor is determined. Subsequently, the determined anonymization algorithm is applied to the data that is referred to by the field descriptor. A testing is performed, as to whether the degree of anonymization fulfills a requirement that is related to the usage scenario. In the case, the requirement is fulfilled, access to the anonymized data is provided.

Type: Grant

Filed: April 23, 2019

Date of Patent: July 21, 2020

Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Albert Maier, Martin Oberhofer, Yannick Saillet
METHOD FOR CREATING RUN-TIME EXECUTABLES FOR DATA ANALYSIS FUNCTIONS

Publication number: 20200225941

Abstract: The present disclosure relates to a method for creating run-time executables for data analysis functions. The method comprises in response to receiving a data analysis request from a user, selecting from a repository a repository of data analysis functions a set of data analysis functions for execution in a hosting environment or on premises of the user. Usage conditions of the set of data analysis functions by the user may be determined. An additional code for applying the determined usage conditions may be created. The selected data analysis functions and the additional code may be compiled, resulting in an executable code. The executable code may be certified. The certified executable code may be deployed or provided for download to a run-time environment for certified executable codes.

Type: Application

Filed: January 15, 2019

Publication date: July 16, 2020

Inventors: Martin Oberhofer, Mike W. Grasselt, Yannick Saillet, Jens P. Seifert
METHOD FOR CREATING RUN-TIME EXECUTABLES FOR DATA ANALYSIS FUNCTIONS

Publication number: 20200225942

Abstract: The present disclosure relates to a method for creating run-time executables for data analysis functions. The method comprises in response to receiving a data analysis request from a user, selecting from a repository a repository of data analysis functions a set of data analysis functions for execution in a hosting environment or on premises of the user. Usage conditions of the set of data analysis functions by the user may be determined. An additional code for applying the determined usage conditions may be created. The selected data analysis functions and the additional code may be compiled, resulting in an executable code. The executable code may be certified. The certified executable code may be deployed or provided for download to a run-time environment for certified executable codes.

Type: Application

Filed: July 2, 2019

Publication date: July 16, 2020

Inventors: Martin Oberhofer, Mike W. Grasselt, Yannick Saillet, Jens P. Seifert
EFFICIENTLY FINDING POTENTIAL DUPLICATE VALUES IN DATA

Publication number: 20200183954

Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.

Type: Application

Filed: February 14, 2020

Publication date: June 11, 2020

Inventors: Namit Kabra, Yannick Saillet
Processing a data set

Patent number: 10671627

Abstract: Embodiments relate to processing a data set stored in a computer system. In one aspect, a method of processing a data set stored in a computer system includes providing one or more parameters for quantifying data quality of the data set. A processor generates, for each parameter of the one or more parameters, a reference pattern indicating a dysfunctional behavior of the values of the parameter. The data set is processed to obtain values of the one or more parameters. A parameter of the one or more parameters is identified whose obtained values match a corresponding reference pattern of the generated reference patterns. The identified parameter is assigned a resource weight value indicating the amount of processing resources required to fix the dysfunctional behavior of the identified parameter.

Type: Grant

Filed: November 29, 2018

Date of Patent: June 2, 2020

Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Sebastian Nelke, Martin Oberhofer, Yannick Saillet, Jens Seifert
CLASSIFYING AN UNMANAGED DATASET

Publication number: 20200151155

Abstract: A computer implemented method for classifying at least one source dataset of a computer system. The method may include providing a plurality of associated reference tables organized and associated in accordance with a reference storage model in the computer system. The method may also include calculating, by a data classifier application of the computer system, a first similarity score between the source dataset and a first reference table of the reference tables based on common attributes in the source dataset and a join of the first reference table with at least one further reference table of the reference tables having a relationship with the first reference table. The method may further include classifying, by the data classifier application, the source dataset by determining using at least the calculated first similarity score whether the source dataset is organized as the first reference table in accordance to the reference storage model.

Type: Application

Filed: January 10, 2020

Publication date: May 14, 2020

Inventors: Martin Oberhofer, Adapala S. Reddy, Yannick Saillet, Jens Seifert
DATA SAMPLING IN A STORAGE SYSTEM

Publication number: 20200142870

Abstract: A computer-implemented method, computer program product and system for data sampling in a storage system. The storage system includes a dataset comprising records and a buffer. The dataset is scanned record-by-record to determine whether the current record belongs to a random sample. If so, then the current record may be added to a first set of records. Otherwise, at least one storage score may be calculated or determined for the current record using attribute values of the current record. Next, it may be determined whether the buffer includes available size for storing the current record. In case the buffer comprises the available size, the current record may be stored in the buffer. Otherwise, at least part of the buffer may be free up. A subsample of the dataset may be provided as a result of merging the first set of records and at least part of the buffered records.

Type: Application

Filed: January 6, 2020

Publication date: May 7, 2020

Inventors: Albert Maier, Yannick Saillet, Damir Spisic
Processing data sets in a big data repository

Patent number: 10635486

Abstract: The invention provides for a method for processing a plurality of data sets (105; 106; 108; 110-113; DB1; DB2) in a data repository (104) for storing at least unstructured data, the method comprising: —providing (302) a set of agents (150-168), each agent being operable to trigger the processing of one or more of the data sets, the execution of each of said agents being automatically triggered in case one or more conditions assigned to said agent are met, at least one of the conditions relating to the existence, structure, content and/or annotations of the data set whose processing can be triggered by said agent; —executing (304) a first one of the agents; —updating (306) the annotations (115) of the first data set by the first agent; and —executing (308) a second one of the agents, said execution being triggered by the updated annotations of the first data set meeting the conditions of the second agent, thereby triggering a further updating of the annotations of the first data set.

Type: Grant

Filed: August 14, 2018

Date of Patent: April 28, 2020

Assignee: International Business Machines Corporation

Inventors: Albert Maier, Yannick Saillet, Harald C. Smith, Daniel C. Wolfson
Efficiently finding potential duplicate values in data

Patent number: 10635693

Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.

Type: Grant

Filed: November 11, 2016

Date of Patent: April 28, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
Multiple record linkage algorithm selector

Patent number: 10621492

Abstract: The present disclosure relates to a method for centrally processing data records using a record linkage algorithm. The method comprises providing a centralized master repository for storing data records in a predefined data structure having a set of attributes. At least one clustering metric is provided. Clusters of records may be determined using a clustering function that is based on the at least one clustering metric. For each particular cluster, a set of configuration data for the record linkage algorithm may be defined based on a value of the clustering metric within that particular cluster. The individual data records may be assigned to one or more clusters of the clusters using the clustering metric values and the record linkage algorithm may be applied to a set of two or more individual data records assigned to at least one common cluster using the set of configuration data for the common cluster.

Type: Grant

Filed: October 21, 2016

Date of Patent: April 14, 2020

Assignee: International Business Machines Corporation

Inventors: Martin Oberhofer, Yannick Saillet, Scott Schumacher, Jens P. Seifert
Multiple record linkage algorithm selector

Patent number: 10621493

Abstract: The present disclosure relates to a method for centrally processing data records using a record linkage algorithm. The method comprises providing a centralized master repository for storing data records in a predefined data structure having a set of attributes. At least one clustering metric is provided. Clusters of records may be determined using a clustering function that is based on the at least one clustering metric. For each particular cluster, a set of configuration data for the record linkage algorithm may be defined based on a value of the clustering metric within that particular cluster. The individual data records may be assigned to one or more clusters of the clusters using the clustering metric values and the record linkage algorithm may be applied to a set of two or more individual data records assigned to at least one common cluster using the set of configuration data for the common cluster.

Type: Grant

Filed: January 2, 2018

Date of Patent: April 14, 2020

Assignee: International Business Machines Corporation

Inventors: Martin Oberhofer, Yannick Saillet, Scott Schumacher, Jens P. Seifert
Classifying an unmanaged dataset

Patent number: 10592481

Abstract: A computer implemented method for classifying at least one source dataset of a computer system. The method may include providing a plurality of associated reference tables organized and associated in accordance with a reference storage model in the computer system. The method may also include calculating, by a data classifier application of the computer system, a first similarity score between the source dataset and a first reference table of the reference tables based on common attributes in the source dataset and a join of the first reference table with at least one further reference table of the reference tables having a relationship with the first reference table. The method may further include classifying, by the data classifier application, the source dataset by determining using at least the calculated first similarity score whether the source dataset is organized as the first reference table in accordance to the reference storage model.

Type: Grant

Filed: April 6, 2017

Date of Patent: March 17, 2020

Assignee: International Business Machines Corporation

Inventors: Martin Oberhofer, Adapala S. Reddy, Yannick Saillet, Jens Seifert
Computing the need for standardization of a set of values

Patent number: 10585865

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Grant

Filed: December 5, 2017

Date of Patent: March 10, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
Computing the need for standardization of a set of values

Patent number: 10585864

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Grant

Filed: November 11, 2016

Date of Patent: March 10, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
Method and system for deduplicating data

Patent number: 10540336

Abstract: A mechanism is provided for deduplicating a set of records of data. The mechanism identifies a subset of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record. The mechanism deduplicates the selected subset of records of the modified set of records responsive to determining the subset of records comprises more than one record.

Type: Grant

Filed: September 26, 2016

Date of Patent: January 21, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
Data sampling in a storage system

Patent number: 10534763

Abstract: A computer-implemented method, computer program product and system for data sampling in a storage system. The storage system includes a dataset comprising records and a buffer. The dataset is scanned record-by-record to determine whether the current record belongs to a random sample. If so, then the current record may be added to a first set of records. Otherwise, at least one storage score may be calculated or determined for the current record using attribute values of the current record. Next, it may be determined whether the buffer includes available size for storing the current record. In case the buffer comprises the available size, the current record may be stored in the buffer. Otherwise, at least part of the buffer may be free up. A subsample of the dataset may be provided as a result of merging the first set of records and at least part of the buffered records.

Type: Grant

Filed: May 10, 2019

Date of Patent: January 14, 2020

Assignee: International Business Machines Corporation

Inventors: Albert Maier, Yannick Saillet, Damir Spisic
Data sampling in a storage system

Patent number: 10534762

Abstract: A computer-implemented method, computer program product and system for data sampling in a storage system. The storage system includes a dataset comprising records and a buffer. The dataset is scanned record-by-record to determine whether the current record belongs to a random sample. If so, then the current record may be added to a first set of records. Otherwise, at least one storage score may be calculated or determined for the current record using attribute values of the current record. Next, it may be determined whether the buffer includes available size for storing the current record. In case the buffer comprises the available size, the current record may be stored in the buffer. Otherwise, at least part of the buffer may be free up. A subsample of the dataset may be provided as a result of merging the first set of records and at least part of the buffered records.

Type: Grant

Filed: May 10, 2019

Date of Patent: January 14, 2020

Assignee: International Business Machines Corporation

Inventors: Albert Maier, Yannick Saillet, Damir Spisic
Method and system for deduplicating data

Patent number: 10528534

Abstract: A mechanism is provided for deduplicating a set of records of data. The mechanism identifies a subset of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record. The mechanism deduplicates the selected subset of records of the modified set of records responsive to determining the subset of records comprises more than one record.

Type: Grant

Filed: November 28, 2017

Date of Patent: January 7, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
COMPUTING THE NEED FOR STANDARDIZATION OF A SET OF VALUES

Publication number: 20190377715

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Application

Filed: August 22, 2019

Publication date: December 12, 2019

Inventors: Namit Kabra, Yannick Saillet

prev 1 2 3 4 5 6 7 8 … next