Patents by Inventor Namit Kabra

Namit Kabra has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Efficiently finding potential duplicate values in data

Patent number: 10635693

Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.

Type: Grant

Filed: November 11, 2016

Date of Patent: April 28, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
Computing the need for standardization of a set of values

Patent number: 10585865

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Grant

Filed: December 5, 2017

Date of Patent: March 10, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
Computing the need for standardization of a set of values

Patent number: 10585864

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Grant

Filed: November 11, 2016

Date of Patent: March 10, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
GENERATING WEIGHTS FOR FINDING DUPLICATE RECORDS

Publication number: 20200026782

Abstract: Data-deduplicating includes comparing a first record of a data-store with a second record of the data-store but instead of using a static weight for a field, the present data-deduplicating dynamically assigns a first weight for the first score to generate a first weighted score, wherein the first weight is based on one or both of the first value or the second value; and assigns a second weight for the second score to generate a second weighted score. A composite score is calculated based on the first weighted score and the second weighted score; and it is determined whether or not the first record and the second record are duplicate records, based on the composite score.

Type: Application

Filed: July 17, 2018

Publication date: January 23, 2020

Inventors: Namit Kabra, Manish A. Bhide
Method and system for deduplicating data

Patent number: 10540336

Abstract: A mechanism is provided for deduplicating a set of records of data. The mechanism identifies a subset of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record. The mechanism deduplicates the selected subset of records of the modified set of records responsive to determining the subset of records comprises more than one record.

Type: Grant

Filed: September 26, 2016

Date of Patent: January 21, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
Method and system for deduplicating data

Patent number: 10528534

Abstract: A mechanism is provided for deduplicating a set of records of data. The mechanism identifies a subset of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record. The mechanism deduplicates the selected subset of records of the modified set of records responsive to determining the subset of records comprises more than one record.

Type: Grant

Filed: November 28, 2017

Date of Patent: January 7, 2020

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
COMPUTING THE NEED FOR STANDARDIZATION OF A SET OF VALUES

Publication number: 20190377714

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Application

Filed: August 22, 2019

Publication date: December 12, 2019

Inventors: Namit Kabra, Yannick Saillet
COMPUTING THE NEED FOR STANDARDIZATION OF A SET OF VALUES

Publication number: 20190377715

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Application

Filed: August 22, 2019

Publication date: December 12, 2019

Inventors: Namit Kabra, Yannick Saillet
Data de-duplication

Patent number: 10467203

Abstract: A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein.

Type: Grant

Filed: May 20, 2015

Date of Patent: November 5, 2019

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
Column weight calculation for data deduplication

Patent number: 10452627

Abstract: A computer system with the capability to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.

Type: Grant

Filed: June 2, 2016

Date of Patent: October 22, 2019

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
Data de-duplication

Patent number: 10387389

Abstract: A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein.

Type: Grant

Filed: September 30, 2014

Date of Patent: August 20, 2019

Assignee: International Business Machines Corporation

Inventors: Namit Kabra, Yannick Saillet
DATA STANDARDIZATION RULES GENERATION

Publication number: 20190179888

Abstract: A method for generating data standardization rules includes receiving a training data set containing tokenized and tagged data values. A set of machine mining models is built using different learning algorithms for identifying tags and tag patterns using the training set. For each data value in a further data set: a tokenization is applied on the data value, resulting in a set of tokens. For each token of the set of tokens one or more tag candidates are determined using a lookup dictionary of tags and tokens and/or at least part of the set of machine mining models, resulting for each token of the set of tokens in a list of possible tags. Unique combinations of the sets of tags of the further data set having highest aggregated confidence values are provided for use as standardization rules.

Type: Application

Filed: December 12, 2017

Publication date: June 13, 2019

Inventors: Yannick Saillet, Martin Oberhofer, Namit Kabra
EFFICIENTLY FINDING POTENTIAL DUPLICATE VALUES IN DATA

Publication number: 20180137189

Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.

Type: Application

Filed: November 11, 2016

Publication date: May 17, 2018

Inventors: Namit Kabra, Yannick Saillet
COMPUTING THE NEED FOR STANDARDIZATION OF A SET OF VALUES

Publication number: 20180137148

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Application

Filed: November 11, 2016

Publication date: May 17, 2018

Inventors: Namit Kabra, Yannick Saillet
COMPUTING THE NEED FOR STANDARDIZATION OF A SET OF VALUES

Publication number: 20180137151

Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.

Type: Application

Filed: December 5, 2017

Publication date: May 17, 2018

Inventors: Namit Kabra, Yannick Saillet
EFFICIENTLY FINDING POTENTIAL DUPLICATE VALUES IN DATA

Publication number: 20180137193

Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.

Type: Application

Filed: December 7, 2017

Publication date: May 17, 2018

Inventors: Namit Kabra, Yannick Saillet
Method and System for Deduplicating Data

Publication number: 20180089233

Abstract: A mechanism is provided for deduplicating a set of records of data. Each record of the set of records has a set of attributes. The mechanism identifies a subset of records of the set of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute of the subset of the set of records the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records of the set of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record.

Type: Application

Filed: September 26, 2016

Publication date: March 29, 2018

Inventors: Namit Kabra, Yannick Saillet
Method and System for Deduplicating Data

Publication number: 20180089235

Abstract: A mechanism is provided for deduplicating a set of records of data. Each record of the set of records has a set of attributes. The mechanism identifies a subset of records of the set of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute of the subset of the set of records the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records of the set of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record.

Type: Application

Filed: November 28, 2017

Publication date: March 29, 2018

Inventors: Namit Kabra, Yannick Saillet
COLUMN WEIGHT CALCULATION FOR DATA DEDUPLICATION

Publication number: 20180067973

Abstract: A method to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.

Type: Application

Filed: November 8, 2017

Publication date: March 8, 2018

Inventors: Namit Kabra, Yannick Saillet
COLUMN WEIGHT CALCULATION FOR DATA DEDUPLICATION

Publication number: 20170351717

Abstract: A computer system with the capability to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.

Type: Application

Filed: June 2, 2016

Publication date: December 7, 2017

Inventors: Namit Kabra, Yannick Saillet

prev 1 2 3 4 5 next