Patents by Inventor Namit Kabra
Namit Kabra has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 10635693Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.Type: GrantFiled: November 11, 2016Date of Patent: April 28, 2020Assignee: International Business Machines CorporationInventors: Namit Kabra, Yannick Saillet
-
Patent number: 10585865Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.Type: GrantFiled: December 5, 2017Date of Patent: March 10, 2020Assignee: International Business Machines CorporationInventors: Namit Kabra, Yannick Saillet
-
Patent number: 10585864Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.Type: GrantFiled: November 11, 2016Date of Patent: March 10, 2020Assignee: International Business Machines CorporationInventors: Namit Kabra, Yannick Saillet
-
Publication number: 20200026782Abstract: Data-deduplicating includes comparing a first record of a data-store with a second record of the data-store but instead of using a static weight for a field, the present data-deduplicating dynamically assigns a first weight for the first score to generate a first weighted score, wherein the first weight is based on one or both of the first value or the second value; and assigns a second weight for the second score to generate a second weighted score. A composite score is calculated based on the first weighted score and the second weighted score; and it is determined whether or not the first record and the second record are duplicate records, based on the composite score.Type: ApplicationFiled: July 17, 2018Publication date: January 23, 2020Inventors: Namit Kabra, Manish A. Bhide
-
Patent number: 10540336Abstract: A mechanism is provided for deduplicating a set of records of data. The mechanism identifies a subset of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record. The mechanism deduplicates the selected subset of records of the modified set of records responsive to determining the subset of records comprises more than one record.Type: GrantFiled: September 26, 2016Date of Patent: January 21, 2020Assignee: International Business Machines CorporationInventors: Namit Kabra, Yannick Saillet
-
Patent number: 10528534Abstract: A mechanism is provided for deduplicating a set of records of data. The mechanism identifies a subset of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record. The mechanism deduplicates the selected subset of records of the modified set of records responsive to determining the subset of records comprises more than one record.Type: GrantFiled: November 28, 2017Date of Patent: January 7, 2020Assignee: International Business Machines CorporationInventors: Namit Kabra, Yannick Saillet
-
Publication number: 20190377714Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.Type: ApplicationFiled: August 22, 2019Publication date: December 12, 2019Inventors: Namit Kabra, Yannick Saillet
-
Publication number: 20190377715Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.Type: ApplicationFiled: August 22, 2019Publication date: December 12, 2019Inventors: Namit Kabra, Yannick Saillet
-
Patent number: 10467203Abstract: A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein.Type: GrantFiled: May 20, 2015Date of Patent: November 5, 2019Assignee: International Business Machines CorporationInventors: Namit Kabra, Yannick Saillet
-
Patent number: 10452627Abstract: A computer system with the capability to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.Type: GrantFiled: June 2, 2016Date of Patent: October 22, 2019Assignee: International Business Machines CorporationInventors: Namit Kabra, Yannick Saillet
-
Patent number: 10387389Abstract: A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein.Type: GrantFiled: September 30, 2014Date of Patent: August 20, 2019Assignee: International Business Machines CorporationInventors: Namit Kabra, Yannick Saillet
-
Publication number: 20190179888Abstract: A method for generating data standardization rules includes receiving a training data set containing tokenized and tagged data values. A set of machine mining models is built using different learning algorithms for identifying tags and tag patterns using the training set. For each data value in a further data set: a tokenization is applied on the data value, resulting in a set of tokens. For each token of the set of tokens one or more tag candidates are determined using a lookup dictionary of tags and tokens and/or at least part of the set of machine mining models, resulting for each token of the set of tokens in a list of possible tags. Unique combinations of the sets of tags of the further data set having highest aggregated confidence values are provided for use as standardization rules.Type: ApplicationFiled: December 12, 2017Publication date: June 13, 2019Inventors: Yannick Saillet, Martin Oberhofer, Namit Kabra
-
Publication number: 20180137189Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.Type: ApplicationFiled: November 11, 2016Publication date: May 17, 2018Inventors: Namit Kabra, Yannick Saillet
-
Publication number: 20180137148Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.Type: ApplicationFiled: November 11, 2016Publication date: May 17, 2018Inventors: Namit Kabra, Yannick Saillet
-
Publication number: 20180137151Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.Type: ApplicationFiled: December 5, 2017Publication date: May 17, 2018Inventors: Namit Kabra, Yannick Saillet
-
Publication number: 20180137193Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.Type: ApplicationFiled: December 7, 2017Publication date: May 17, 2018Inventors: Namit Kabra, Yannick Saillet
-
Publication number: 20180089233Abstract: A mechanism is provided for deduplicating a set of records of data. Each record of the set of records has a set of attributes. The mechanism identifies a subset of records of the set of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute of the subset of the set of records the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records of the set of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record.Type: ApplicationFiled: September 26, 2016Publication date: March 29, 2018Inventors: Namit Kabra, Yannick Saillet
-
Publication number: 20180089235Abstract: A mechanism is provided for deduplicating a set of records of data. Each record of the set of records has a set of attributes. The mechanism identifies a subset of records of the set of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute of the subset of the set of records the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records of the set of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record.Type: ApplicationFiled: November 28, 2017Publication date: March 29, 2018Inventors: Namit Kabra, Yannick Saillet
-
Publication number: 20180067973Abstract: A method to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.Type: ApplicationFiled: November 8, 2017Publication date: March 8, 2018Inventors: Namit Kabra, Yannick Saillet
-
Publication number: 20170351717Abstract: A computer system with the capability to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.Type: ApplicationFiled: June 2, 2016Publication date: December 7, 2017Inventors: Namit Kabra, Yannick Saillet