Patents by Inventor Namit Kabra

Namit Kabra has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 10635693
    Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.
    Type: Grant
    Filed: November 11, 2016
    Date of Patent: April 28, 2020
    Assignee: International Business Machines Corporation
    Inventors: Namit Kabra, Yannick Saillet
  • Patent number: 10585865
    Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.
    Type: Grant
    Filed: December 5, 2017
    Date of Patent: March 10, 2020
    Assignee: International Business Machines Corporation
    Inventors: Namit Kabra, Yannick Saillet
  • Patent number: 10585864
    Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.
    Type: Grant
    Filed: November 11, 2016
    Date of Patent: March 10, 2020
    Assignee: International Business Machines Corporation
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20200026782
    Abstract: Data-deduplicating includes comparing a first record of a data-store with a second record of the data-store but instead of using a static weight for a field, the present data-deduplicating dynamically assigns a first weight for the first score to generate a first weighted score, wherein the first weight is based on one or both of the first value or the second value; and assigns a second weight for the second score to generate a second weighted score. A composite score is calculated based on the first weighted score and the second weighted score; and it is determined whether or not the first record and the second record are duplicate records, based on the composite score.
    Type: Application
    Filed: July 17, 2018
    Publication date: January 23, 2020
    Inventors: Namit Kabra, Manish A. Bhide
  • Patent number: 10540336
    Abstract: A mechanism is provided for deduplicating a set of records of data. The mechanism identifies a subset of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record. The mechanism deduplicates the selected subset of records of the modified set of records responsive to determining the subset of records comprises more than one record.
    Type: Grant
    Filed: September 26, 2016
    Date of Patent: January 21, 2020
    Assignee: International Business Machines Corporation
    Inventors: Namit Kabra, Yannick Saillet
  • Patent number: 10528534
    Abstract: A mechanism is provided for deduplicating a set of records of data. The mechanism identifies a subset of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record. The mechanism deduplicates the selected subset of records of the modified set of records responsive to determining the subset of records comprises more than one record.
    Type: Grant
    Filed: November 28, 2017
    Date of Patent: January 7, 2020
    Assignee: International Business Machines Corporation
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20190377714
    Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.
    Type: Application
    Filed: August 22, 2019
    Publication date: December 12, 2019
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20190377715
    Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.
    Type: Application
    Filed: August 22, 2019
    Publication date: December 12, 2019
    Inventors: Namit Kabra, Yannick Saillet
  • Patent number: 10467203
    Abstract: A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein.
    Type: Grant
    Filed: May 20, 2015
    Date of Patent: November 5, 2019
    Assignee: International Business Machines Corporation
    Inventors: Namit Kabra, Yannick Saillet
  • Patent number: 10452627
    Abstract: A computer system with the capability to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.
    Type: Grant
    Filed: June 2, 2016
    Date of Patent: October 22, 2019
    Assignee: International Business Machines Corporation
    Inventors: Namit Kabra, Yannick Saillet
  • Patent number: 10387389
    Abstract: A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein.
    Type: Grant
    Filed: September 30, 2014
    Date of Patent: August 20, 2019
    Assignee: International Business Machines Corporation
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20190179888
    Abstract: A method for generating data standardization rules includes receiving a training data set containing tokenized and tagged data values. A set of machine mining models is built using different learning algorithms for identifying tags and tag patterns using the training set. For each data value in a further data set: a tokenization is applied on the data value, resulting in a set of tokens. For each token of the set of tokens one or more tag candidates are determined using a lookup dictionary of tags and tokens and/or at least part of the set of machine mining models, resulting for each token of the set of tokens in a list of possible tags. Unique combinations of the sets of tags of the further data set having highest aggregated confidence values are provided for use as standardization rules.
    Type: Application
    Filed: December 12, 2017
    Publication date: June 13, 2019
    Inventors: Yannick Saillet, Martin Oberhofer, Namit Kabra
  • Publication number: 20180137189
    Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.
    Type: Application
    Filed: November 11, 2016
    Publication date: May 17, 2018
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20180137148
    Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.
    Type: Application
    Filed: November 11, 2016
    Publication date: May 17, 2018
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20180137151
    Abstract: A method, system and computer program product for determining a data standardization score for an attribute of a dataset. A data standardization score is calculated, which reflects whether data quality of attribute values would increase if a standardization rule is applied to the attribute values. Based on attribute metadata, it may be determined whether an indication to carry or not to carry out standardization is available for at least part of the attribute values of the dataset. In response to finding the indication, a respective value may be set for the data standardization score. In response to not finding the indication, a data standardization score algorithm may be run on the at least part of the attribute values of the dataset. The data standardization score value may be compared to a predefined criterion to determine whether data standardization is to be applied on the attribute.
    Type: Application
    Filed: December 5, 2017
    Publication date: May 17, 2018
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20180137193
    Abstract: A method, system and computer program product for finding groups of potential duplicates in attribute values. Each attribute value of the attribute values is converted to a respective set of bigrams. All bigrams present in the attribute values may be determined. Bigrams present in the attribute values may be represented as bits. This may result in a bitmap representing the presence of the bigrams in the attribute values. The attribute values may be grouped using bitwise operations on the bitmap, where each group includes attribute values that are determined based on pairwise bigram-based similarity scores. The pairwise bigram-based similarity score reflects the number of common bigrams between two attribute values.
    Type: Application
    Filed: December 7, 2017
    Publication date: May 17, 2018
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20180089233
    Abstract: A mechanism is provided for deduplicating a set of records of data. Each record of the set of records has a set of attributes. The mechanism identifies a subset of records of the set of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute of the subset of the set of records the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records of the set of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record.
    Type: Application
    Filed: September 26, 2016
    Publication date: March 29, 2018
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20180089235
    Abstract: A mechanism is provided for deduplicating a set of records of data. Each record of the set of records has a set of attributes. The mechanism identifies a subset of records of the set of records each having one or more invalid attribute values. For each invalid attribute value of a given attribute of the subset of the set of records the mechanism determines one or more associated valid candidates of attribute values of the given attribute using the set of records. For each record of the subset of records of the set of records the mechanism replaces the one or more invalid attribute values by one or more combinations of the determined valid candidates of attribute values, resulting in a modified set of records. The mechanism selects a subset of records of the modified set of records that satisfy a consistency condition on the attribute values of each record.
    Type: Application
    Filed: November 28, 2017
    Publication date: March 29, 2018
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20180067973
    Abstract: A method to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.
    Type: Application
    Filed: November 8, 2017
    Publication date: March 8, 2018
    Inventors: Namit Kabra, Yannick Saillet
  • Publication number: 20170351717
    Abstract: A computer system with the capability to identify potentially duplicative records in a data set is provided. A computer may collect a data profile for the data set that provides descriptive information with regard to attributes of the data set. Based, at least in part, on the data profile, weights are determined for the attributes. As values of a data record are compared to values of the same respective attributes in other records, the overall likelihood of a match or duplicate, as indicated by the degree of similarity between values, is modified based on the determined weights associated with the respective attributes.
    Type: Application
    Filed: June 2, 2016
    Publication date: December 7, 2017
    Inventors: Namit Kabra, Yannick Saillet