Patents by Inventor Rajesh M. Desai

Rajesh M. Desai has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20250117405
    Abstract: A computer-implemented process for generating cluster templates used for creating extraction models includes the following operations. A plurality of training files associated with a selected class are received. An automated visual analysis is performed on each of the plurality of training files. An automated contextual analysis is performed on each of the plurality of training files. A first clustering of the plurality of training files into a first plurality of clusters using results from the automated visual analysis is performed. A second clustering of one of the plurality of clusters into a second plurality of clusters is performed using results from the automated contextual analysis. Cluster templates for the first and second plurality of clusters are generated.
    Type: Application
    Filed: October 6, 2023
    Publication date: April 10, 2025
    Inventors: Shalin Avlani, Rajesh M. Desai, Mayank Vipin Shah, Xiaoying Gao
  • Patent number: 12259920
    Abstract: Disclosed embodiments provide techniques for monitoring and evaluating the effectiveness of key value pairs (KVPs) used in a document processing system. In embodiments, KVPs are obtained from multiple extractors of a document processing system. A score is computed for the KVPs by computing an effectiveness metric for each KVP from the multiple KVPs. In response to the computed score being below a predetermined threshold, a model retraining process is performed to generate a new set of KVP extractors, and provide the new set of KVPs to the document processing system.
    Type: Grant
    Filed: September 7, 2023
    Date of Patent: March 25, 2025
    Assignee: International Business Machines Corporation
    Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Yang Zhong Li, Rajesh M. Desai, Xue Lan Zhang
  • Publication number: 20250086222
    Abstract: Disclosed embodiments provide techniques for monitoring and evaluating the effectiveness of key value pairs (KVPs) used in a document processing system. In embodiments, KVPs are obtained from multiple extractors of a document processing system. A score is computed for the KVPs by computing an effectiveness metric for each KVP from the multiple KVPs. In response to the computed score being below a predetermined threshold, a model retraining process is performed to generate a new set of KVP extractors, and provide the new set of KVPs to the document processing system.
    Type: Application
    Filed: September 7, 2023
    Publication date: March 13, 2025
    Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Yang Zhong Li, Rajesh M. Desai, Xue Lan Zhang
  • Publication number: 20240420296
    Abstract: Mechanisms are provided for automated document image annotation and data extraction. A received document image is processed to identify a document type of the received document image, and retrieve a corresponding document template having key point location data and annotation location data for documents of the document type. First key points of the received document image are matched with second key points of the corresponding document template and a mapping is generated to map locations of the document template to locations of the received document image. A perspective transformation is applied, based on the mapping, to first annotation locations specified in the document template data structure to generate second annotation locations corresponding to locations in the received document image. Data extraction is performed on data associated with the second annotation locations based on the annotations corresponding to the second annotation locations.
    Type: Application
    Filed: June 14, 2023
    Publication date: December 19, 2024
    Inventors: Kai Zhang, Xiaoying Gao, Rajesh M. Desai, Sudhakar Basireddy
  • Patent number: 12056948
    Abstract: In an approach, a processor identifies a plurality of text separators in a borderless table, a text separator of the plurality of text separators defining a non-text region between two consecutive text lines in the borderless table. A processor classifies the plurality of text separators into a number of target clusters comprised in a target group based on property information related to the plurality of text separators, the number of target clusters corresponding to a number of separator types. A processor provides indication information to indicate respective separator types of the plurality of text separators based on a result of the classifying.
    Type: Grant
    Filed: July 19, 2021
    Date of Patent: August 6, 2024
    Assignee: International Business Machines Corporation
    Inventors: Ang Yi, Nazrul Islam, Rajesh M. Desai, Jing Zhang, Dong Rui Li, Xue Mei Deng, Ye Chen, Hai Cheng Wang
  • Publication number: 20240193978
    Abstract: Computer implemented methods, systems, and computer program products include program code executing on a processor(s) that merges a document comprising multiple pages into a single document image. The program code processes the single document image to identify structural elements and textual content. The program code compares the structural elements of the single document image to other structural elements of a group of document templates stored in a database to identify a subset of the group of documents templates with a threshold number of similarities to the single document image. The program code generates, from the single document image, a graph structure representing the document, where the graph structure comprises visual information and connections related to the structural elements and concepts comprising the textual content. The program code uses the structure to identify a document template that is a closest match to the document.
    Type: Application
    Filed: December 13, 2022
    Publication date: June 13, 2024
    Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Rajesh M. Desai, Yang Zhong Li, Ye Chen
  • Publication number: 20240046677
    Abstract: A computer-implemented method for text block segmentation includes determining a first text block segmentation pattern utilized to generate a segmented text block based, at least in part, on a comparison of semantic information associated with the segmented text block and a plurality of predefined types of text block segmentation patterns indicated by a graph; calculating a first degree of confidence in a size of the segmented text block based, at least in part, on comparing semantic entities associated with the segmented text block with semantic entities indicated by leaf nodes stemming from a first non-leaf node included in the graph and representative of the first type of text block segmentation pattern; and determining that the size of the segmented text block is non-optimal based on the calculated degree of confidence in the size of the segmented text block being below a predetermined threshold.
    Type: Application
    Filed: July 26, 2022
    Publication date: February 8, 2024
    Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Rajesh M. Desai, Yang Zhong Li, Xue Xu
  • Patent number: 11842524
    Abstract: A mechanism is provided for implementing an optical character recognition (OCR) error correction mechanism for correcting OCR errors. Responsive to receiving a document in which OCR has been performed, the mechanism assesses the document to identify a set of OCR errors generated by an OCR engine that performed the OCR using a set of visual embeddings. Responsive to identifying the set of OCR errors, the mechanism analyzes each character of a plurality of sentences within the document to generate a high-dimensional embedding for the characters of the plurality of sentences within the document. The mechanism then linguistically corrects each OCR error in the set of OCR error. The mechanism utilizes ground truth information and the set of visual embeddings to verify that character stream is linguistically correct. Responsive to verifying that the character stream is linguistically correct, the mechanism outputs an OCR error corrected document to a user.
    Type: Grant
    Filed: April 30, 2021
    Date of Patent: December 12, 2023
    Assignee: International Business Machines Corporation
    Inventors: Rajesh M. Desai, Ayush Utkarsh, Nazrul Islam, Praveen Vyas
  • Patent number: 11829424
    Abstract: Discovering second-order documents and latent custodians in an e-discovery system is provided. A list of first-order documents and document custodians within a base state of the e-discovery system are identified based on a plurality of terms corresponding to a meet and confer practice for a legal matter instance. The plurality of terms is masked within the first-order documents. The first-order documents having the plurality of terms masked are divided into groups. A list of second-order documents is generated from a group of documents. A list of second-order document custodians is generated based on corresponding custodian relationships to second-order documents. Finally, each second-order document custodian in the list of second-order document custodians that has a corresponding rank exceeding a defined rank threshold level is identified as an official document custodian in the e-discovery system.
    Type: Grant
    Filed: February 20, 2020
    Date of Patent: November 28, 2023
    Assignee: International Business Machines Corporation
    Inventors: Roger C. Raphael, Rajesh M. Desai, Nazrul Islam, Magesh Jayapandian, Jojo Joseph
  • Patent number: 11741258
    Abstract: Dynamic data dissemination is provided. A resolved data subject identifier corresponding to a data subject is selected from a set of resolved data subject identifiers existing in rows of a data asset. In response to determining that the resolved data subject identifier does not correspond to a right to forget list, it is determined that the resolved data subject identifier corresponds to a data subject request list. The rows are transformed to anonymize existing pseudo and personal identifiers in cells of the rows that are tied to columns associated with data classes for which specific consent dimensions have been indicated as revoked by the data subject.
    Type: Grant
    Filed: April 16, 2021
    Date of Patent: August 29, 2023
    Assignee: International Business Machines Corporation
    Inventors: Roger C. Raphael, Rajesh M. Desai, Scott Schumacher, Angineh Aghakiant
  • Patent number: 11647004
    Abstract: Preserving distributions of data values of a data asset in a data anonymization operation is provided. Anonymizing data values is performed by transforming sensitive data in a set of columns over rows of the data asset while preserving distribution of the data values in the set of transformed columns to a defined degree using a set of autoencoders and loss function. The autoencoders are base trained from preexisting data in a data assets catalog and actively trained during data dissemination. Parametric coefficients of the loss function are configured and the threshold is generated using policies from an enforcement decision for the data asset and data consumer. The loss function value of a selected row is compared to the threshold. Transformed data values of the selected row are transcribed to an output row when the loss function value is greater than the threshold and disseminated to the data consumer.
    Type: Grant
    Filed: March 24, 2021
    Date of Patent: May 9, 2023
    Assignee: International Business Machines Corporation
    Inventors: Arjun Natarajan, Ashish Kundu, Roger C. Raphael, Aniya Aggarwal, Rajesh M. Desai, Joshua F. Payne, Mu Qiao
  • Publication number: 20230012784
    Abstract: In an approach, a processor identifies a plurality of text separators in a borderless table, a text separator of the plurality of text separators defining a non-text region between two consecutive text lines in the borderless table. A processor classifies the plurality of text separators into a number of target clusters comprised in a target group based on property information related to the plurality of text separators, the number of target clusters corresponding to a number of separator types. A processor provides indication information to indicate respective separator types of the plurality of text separators based on a result of the classifying.
    Type: Application
    Filed: July 19, 2021
    Publication date: January 19, 2023
    Inventors: Ang Yi, Nazrul Islam, Rajesh M. Desai, Jing Zhang, Dong Rui Li, Xue Mei Deng, Ye Chen, Hai Cheng Wang
  • Publication number: 20220350998
    Abstract: A mechanism is provided for implementing an optical character recognition (OCR) error correction mechanism for correcting OCR errors. Responsive to receiving a document in which OCR has been performed, the mechanism assesses the document to identify a set of OCR errors generated by an OCR engine that performed the OCR using a set of visual embeddings. Responsive to identifying the set of OCR errors, the mechanism analyzes each character of a plurality of sentences within the document to generate a high-dimensional embedding for the characters of the plurality of sentences within the document. The mechanism then linguistically corrects each OCR error in the set of OCR error. The mechanism utilizes ground truth information and the set of visual embeddings to verify that character stream is linguistically correct. Responsive to verifying that the character stream is linguistically correct, the mechanism outputs an OCR error corrected document to a user.
    Type: Application
    Filed: April 30, 2021
    Publication date: November 3, 2022
    Inventors: Rajesh M. Desai, Ayush Utkarsh, Nazrul Islam, Praveen Vyas
  • Publication number: 20220335156
    Abstract: Dynamic data dissemination is provided. A resolved data subject identifier corresponding to a data subject is selected from a set of resolved data subject identifiers existing in rows of a data asset. In response to determining that the resolved data subject identifier does not correspond to a right to forget list, it is determined that the resolved data subject identifier corresponds to a data subject request list. The rows are transformed to anonymize existing pseudo and personal identifiers in cells of the rows that are tied to columns associated with data classes for which specific consent dimensions have been indicated as revoked by the data subject.
    Type: Application
    Filed: April 16, 2021
    Publication date: October 20, 2022
    Inventors: Roger C. Raphael, Rajesh M. Desai, Scott Schumacher, Angineh Aghakiant
  • Publication number: 20220311749
    Abstract: Preserving distributions of data values of a data asset in a data anonymization operation is provided. Anonymizing data values is performed by transforming sensitive data in a set of columns over rows of the data asset while preserving distribution of the data values in the set of transformed columns to a defined degree using a set of autoencoders and loss function. The autoencoders are base trained from preexisting data in a data assets catalog and actively trained during data dissemination. Parametric coefficients of the loss function are configured and the threshold is generated using policies from an enforcement decision for the data asset and data consumer. The loss function value of a selected row is compared to the threshold. Transformed data values of the selected row are transcribed to an output row when the loss function value is greater than the threshold and disseminated to the data consumer.
    Type: Application
    Filed: March 24, 2021
    Publication date: September 29, 2022
    Inventors: Arjun Natarajan, ASHISH KUNDU, Roger C. Raphael, Aniya Aggarwal, Rajesh M. Desai, Joshua F. Payne, Mu Qiao
  • Patent number: 11372831
    Abstract: Processing a database query for sets of data includes assigning a unique identifier from an integer space to each entity within data and creating one or more sets of entities each pertaining to a corresponding entity within the data. A representation is then generated on disk for each set of entities, wherein each representation encompasses and is suited for a range of the unique identifiers of entities within a corresponding set and indicates a presence of an entity within that corresponding set. Finally, a query is processed based on the representation for each set of entities to retrieve data satisfying the query, wherein the representation provides a constant time for association and dissociation operations that are append-only operations with deferred merge and automatic filtering of deleted and duplicate entities at query time.
    Type: Grant
    Filed: July 29, 2019
    Date of Patent: June 28, 2022
    Assignee: International Business Machines Corporation
    Inventors: Rajesh M. Desai, Magesh Jayapandian, Iun V. Leong, Justo L. Perez, Roger C. Raphael, Gabriel Valencia
  • Patent number: 11362997
    Abstract: A method, apparatus, system, and computer program product evaluate an information asset with a corpus of policies in conjunction with the context of access including a specific user. A large corresponding set of rules in the policy corpus are identified by computer system. A continuous process of rule evaluation occurs against information asset metadata wherein a series of processing including set of common subexpressions between the predicates of all active rules, pre-evaluation, compaction and storage are identified by the computer system in the policy and rule corpus. Metadata for the information asset is applied by the computer system to the set of common subexpressions to form partially evaluated rules for the policy. The partially evaluated rules henceforth compacted are stored by the computer system in association with the information asset.
    Type: Grant
    Filed: October 16, 2019
    Date of Patent: June 14, 2022
    Assignee: International Business Machines Corporation
    Inventors: Roger C. Raphael, Rajesh M. Desai, Iun Veng Leong, Brian Joseph Owings
  • Patent number: 11347891
    Abstract: Disclosed is a computer-implemented method to identify and anonymize personal information, the method comprising analyzing a first corpus with a personal information sniffer, wherein the first corpus includes unstructured text, wherein the personal information sniffer is configured to detect a set of types of personal information, and wherein the personal information sniffer produces a first set of results. The method comprises analyzing the first corpus with a set of annotators, wherein each annotator is configured to identify all instances of a type of personal information in the corpus, and wherein the set of annotators produces a second set of results. The method comprises comparing the first set of results and the second set of results, determining, the first set of results does not match the second set of results, and updating, based on the determining, the personal information sniffer.
    Type: Grant
    Filed: June 19, 2019
    Date of Patent: May 31, 2022
    Assignee: International Business Machines Corporation
    Inventors: Roger C. Raphael, Rajesh M. Desai, Iun Veng Leong, Ramakanta Samal, Ansel Blume
  • Patent number: 11308235
    Abstract: A method, system and computer program product for detecting sensitive personal information in a storage device. A block delta list containing a list of changed blocks in the storage device is processed. After identifying the changed blocks from the block delta list, a search is performed on those identified changed blocks for sensitive personal information using a character scanning technique. After identifying a changed block deemed to contain sensitive personal information, the changed block is translated from the block level to the file level using a hierarchical reverse mapping technique. By only analyzing the changed blocks to determine if they contain sensitive personal information, a lesser quantity of blocks needs to be processed in order to detect sensitive personal information in the storage device in near real-time. In this manner, sensitive personal information is detected in the storage device using fewer computing resources in a shorter amount of time.
    Type: Grant
    Filed: March 6, 2020
    Date of Patent: April 19, 2022
    Assignee: International Business Machines Corporation
    Inventors: Rajesh M. Desai, Mu Qiao, Roger C. Raphael, Ramani Routray
  • Patent number: 11250527
    Abstract: Embodiments generally relate to providing litigation management for multiple remote content systems using asynchronous bi-directional replication pipelines. In some embodiments, a method includes retrieving, at one or more inbound replicators of one or more respective bi-directional pipelines, metadata associated with documents stored in one or more content repositories. The method further includes resolving, at a governance control hub, conflicts associated with legal holds on one or more of the documents based on the metadata. The method further includes sending conflict resolution results from one or more outbound applicators of the bi-directional pipelines to the content repositories, where the content repositories enforce legal holds on the documents.
    Type: Grant
    Filed: June 18, 2019
    Date of Patent: February 15, 2022
    Assignee: International Business Machines Corporation
    Inventors: Roger C. Raphael, Ronald L. Rathgeber, Rajesh M. Desai, Gabriel Valencia, Justo Perez, William Russell Belknap, Sudhakar Basireddy