Patents by Inventor Rajesh M. Desai

Rajesh M. Desai has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

MULTI-LAYER APPROACH TO IMPROVING GENERATION OF FIELD EXTRACTION MODELS

Publication number: 20250117405

Abstract: A computer-implemented process for generating cluster templates used for creating extraction models includes the following operations. A plurality of training files associated with a selected class are received. An automated visual analysis is performed on each of the plurality of training files. An automated contextual analysis is performed on each of the plurality of training files. A first clustering of the plurality of training files into a first plurality of clusters using results from the automated visual analysis is performed. A second clustering of one of the plurality of clusters into a second plurality of clusters is performed using results from the automated contextual analysis. Cluster templates for the first and second plurality of clusters are generated.

Type: Application

Filed: October 6, 2023

Publication date: April 10, 2025

Inventors: Shalin Avlani, Rajesh M. Desai, Mayank Vipin Shah, Xiaoying Gao
Dynamic optimization of key value pair extractors for document data extraction

Patent number: 12259920

Abstract: Disclosed embodiments provide techniques for monitoring and evaluating the effectiveness of key value pairs (KVPs) used in a document processing system. In embodiments, KVPs are obtained from multiple extractors of a document processing system. A score is computed for the KVPs by computing an effectiveness metric for each KVP from the multiple KVPs. In response to the computed score being below a predetermined threshold, a model retraining process is performed to generate a new set of KVP extractors, and provide the new set of KVPs to the document processing system.

Type: Grant

Filed: September 7, 2023

Date of Patent: March 25, 2025

Assignee: International Business Machines Corporation

Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Yang Zhong Li, Rajesh M. Desai, Xue Lan Zhang
DYNAMIC OPTIMIZATION OF KEY VALUE PAIR EXTRACTORS FOR DOCUMENT DATA EXTRACTION

Publication number: 20250086222

Abstract: Disclosed embodiments provide techniques for monitoring and evaluating the effectiveness of key value pairs (KVPs) used in a document processing system. In embodiments, KVPs are obtained from multiple extractors of a document processing system. A score is computed for the KVPs by computing an effectiveness metric for each KVP from the multiple KVPs. In response to the computed score being below a predetermined threshold, a model retraining process is performed to generate a new set of KVP extractors, and provide the new set of KVPs to the document processing system.

Type: Application

Filed: September 7, 2023

Publication date: March 13, 2025

Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Yang Zhong Li, Rajesh M. Desai, Xue Lan Zhang
Annotation Based Document Processing with Imperfect Document Images

Publication number: 20240420296

Abstract: Mechanisms are provided for automated document image annotation and data extraction. A received document image is processed to identify a document type of the received document image, and retrieve a corresponding document template having key point location data and annotation location data for documents of the document type. First key points of the received document image are matched with second key points of the corresponding document template and a mapping is generated to map locations of the document template to locations of the received document image. A perspective transformation is applied, based on the mapping, to first annotation locations specified in the document template data structure to generate second annotation locations corresponding to locations in the received document image. Data extraction is performed on data associated with the second annotation locations based on the annotations corresponding to the second annotation locations.

Type: Application

Filed: June 14, 2023

Publication date: December 19, 2024

Inventors: Kai Zhang, Xiaoying Gao, Rajesh M. Desai, Sudhakar Basireddy
Line item detection in borderless tabular structured data

Patent number: 12056948

Abstract: In an approach, a processor identifies a plurality of text separators in a borderless table, a text separator of the plurality of text separators defining a non-text region between two consecutive text lines in the borderless table. A processor classifies the plurality of text separators into a number of target clusters comprised in a target group based on property information related to the plurality of text separators, the number of target clusters corresponding to a number of separator types. A processor provides indication information to indicate respective separator types of the plurality of text separators based on a result of the classifying.

Type: Grant

Filed: July 19, 2021

Date of Patent: August 6, 2024

Assignee: International Business Machines Corporation

Inventors: Ang Yi, Nazrul Islam, Rajesh M. Desai, Jing Zhang, Dong Rui Li, Xue Mei Deng, Ye Chen, Hai Cheng Wang
DOCUMENT IMAGE TEMPLATE MATCHING

Publication number: 20240193978

Abstract: Computer implemented methods, systems, and computer program products include program code executing on a processor(s) that merges a document comprising multiple pages into a single document image. The program code processes the single document image to identify structural elements and textual content. The program code compares the structural elements of the single document image to other structural elements of a group of document templates stored in a database to identify a subset of the group of documents templates with a threshold number of similarities to the single document image. The program code generates, from the single document image, a graph structure representing the document, where the graph structure comprises visual information and connections related to the structural elements and concepts comprising the textual content. The program code uses the structure to identify a document template that is a closest match to the document.

Type: Application

Filed: December 13, 2022

Publication date: June 13, 2024

Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Rajesh M. Desai, Yang Zhong Li, Ye Chen
TEXT BLOCK SEGMENTATION

Publication number: 20240046677

Abstract: A computer-implemented method for text block segmentation includes determining a first text block segmentation pattern utilized to generate a segmented text block based, at least in part, on a comparison of semantic information associated with the segmented text block and a plurality of predefined types of text block segmentation patterns indicated by a graph; calculating a first degree of confidence in a size of the segmented text block based, at least in part, on comparing semantic entities associated with the segmented text block with semantic entities indicated by leaf nodes stemming from a first non-leaf node included in the graph and representative of the first type of text block segmentation pattern; and determining that the size of the segmented text block is non-optimal based on the calculated degree of confidence in the size of the segmented text block being below a predetermined threshold.

Type: Application

Filed: July 26, 2022

Publication date: February 8, 2024

Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Rajesh M. Desai, Yang Zhong Li, Xue Xu
Multi-modal learning based intelligent enhancement of post optical character recognition error correction

Patent number: 11842524

Abstract: A mechanism is provided for implementing an optical character recognition (OCR) error correction mechanism for correcting OCR errors. Responsive to receiving a document in which OCR has been performed, the mechanism assesses the document to identify a set of OCR errors generated by an OCR engine that performed the OCR using a set of visual embeddings. Responsive to identifying the set of OCR errors, the mechanism analyzes each character of a plurality of sentences within the document to generate a high-dimensional embedding for the characters of the plurality of sentences within the document. The mechanism then linguistically corrects each OCR error in the set of OCR error. The mechanism utilizes ground truth information and the set of visual embeddings to verify that character stream is linguistically correct. Responsive to verifying that the character stream is linguistically correct, the mechanism outputs an OCR error corrected document to a user.

Type: Grant

Filed: April 30, 2021

Date of Patent: December 12, 2023

Assignee: International Business Machines Corporation

Inventors: Rajesh M. Desai, Ayush Utkarsh, Nazrul Islam, Praveen Vyas
Discovering latent custodians and documents in an E-discovery system

Patent number: 11829424

Abstract: Discovering second-order documents and latent custodians in an e-discovery system is provided. A list of first-order documents and document custodians within a base state of the e-discovery system are identified based on a plurality of terms corresponding to a meet and confer practice for a legal matter instance. The plurality of terms is masked within the first-order documents. The first-order documents having the plurality of terms masked are divided into groups. A list of second-order documents is generated from a group of documents. A list of second-order document custodians is generated based on corresponding custodian relationships to second-order documents. Finally, each second-order document custodian in the list of second-order document custodians that has a corresponding rank exceeding a defined rank threshold level is identified as an official document custodian in the e-discovery system.

Type: Grant

Filed: February 20, 2020

Date of Patent: November 28, 2023

Assignee: International Business Machines Corporation

Inventors: Roger C. Raphael, Rajesh M. Desai, Nazrul Islam, Magesh Jayapandian, Jojo Joseph
Dynamic data dissemination under declarative data subject constraints

Patent number: 11741258

Abstract: Dynamic data dissemination is provided. A resolved data subject identifier corresponding to a data subject is selected from a set of resolved data subject identifiers existing in rows of a data asset. In response to determining that the resolved data subject identifier does not correspond to a right to forget list, it is determined that the resolved data subject identifier corresponds to a data subject request list. The rows are transformed to anonymize existing pseudo and personal identifiers in cells of the rows that are tied to columns associated with data classes for which specific consent dimensions have been indicated as revoked by the data subject.

Type: Grant

Filed: April 16, 2021

Date of Patent: August 29, 2023

Assignee: International Business Machines Corporation

Inventors: Roger C. Raphael, Rajesh M. Desai, Scott Schumacher, Angineh Aghakiant
Learning to transform sensitive data with variable distribution preservation

Patent number: 11647004

Abstract: Preserving distributions of data values of a data asset in a data anonymization operation is provided. Anonymizing data values is performed by transforming sensitive data in a set of columns over rows of the data asset while preserving distribution of the data values in the set of transformed columns to a defined degree using a set of autoencoders and loss function. The autoencoders are base trained from preexisting data in a data assets catalog and actively trained during data dissemination. Parametric coefficients of the loss function are configured and the threshold is generated using policies from an enforcement decision for the data asset and data consumer. The loss function value of a selected row is compared to the threshold. Transformed data values of the selected row are transcribed to an output row when the loss function value is greater than the threshold and disseminated to the data consumer.

Type: Grant

Filed: March 24, 2021

Date of Patent: May 9, 2023

Assignee: International Business Machines Corporation

Inventors: Arjun Natarajan, Ashish Kundu, Roger C. Raphael, Aniya Aggarwal, Rajesh M. Desai, Joshua F. Payne, Mu Qiao
LINE ITEM DETECTION IN BORDERLESS TABULAR STRUCTURED DATA

Publication number: 20230012784

Abstract: In an approach, a processor identifies a plurality of text separators in a borderless table, a text separator of the plurality of text separators defining a non-text region between two consecutive text lines in the borderless table. A processor classifies the plurality of text separators into a number of target clusters comprised in a target group based on property information related to the plurality of text separators, the number of target clusters corresponding to a number of separator types. A processor provides indication information to indicate respective separator types of the plurality of text separators based on a result of the classifying.

Type: Application

Filed: July 19, 2021

Publication date: January 19, 2023

Inventors: Ang Yi, Nazrul Islam, Rajesh M. Desai, Jing Zhang, Dong Rui Li, Xue Mei Deng, Ye Chen, Hai Cheng Wang
Multi-Modal Learning Based Intelligent Enhancement of Post Optical Character Recognition Error Correction

Publication number: 20220350998

Abstract: A mechanism is provided for implementing an optical character recognition (OCR) error correction mechanism for correcting OCR errors. Responsive to receiving a document in which OCR has been performed, the mechanism assesses the document to identify a set of OCR errors generated by an OCR engine that performed the OCR using a set of visual embeddings. Responsive to identifying the set of OCR errors, the mechanism analyzes each character of a plurality of sentences within the document to generate a high-dimensional embedding for the characters of the plurality of sentences within the document. The mechanism then linguistically corrects each OCR error in the set of OCR error. The mechanism utilizes ground truth information and the set of visual embeddings to verify that character stream is linguistically correct. Responsive to verifying that the character stream is linguistically correct, the mechanism outputs an OCR error corrected document to a user.

Type: Application

Filed: April 30, 2021

Publication date: November 3, 2022

Inventors: Rajesh M. Desai, Ayush Utkarsh, Nazrul Islam, Praveen Vyas
Dynamic Data Dissemination Under Declarative Data Subject Constraint

Publication number: 20220335156

Abstract: Dynamic data dissemination is provided. A resolved data subject identifier corresponding to a data subject is selected from a set of resolved data subject identifiers existing in rows of a data asset. In response to determining that the resolved data subject identifier does not correspond to a right to forget list, it is determined that the resolved data subject identifier corresponds to a data subject request list. The rows are transformed to anonymize existing pseudo and personal identifiers in cells of the rows that are tied to columns associated with data classes for which specific consent dimensions have been indicated as revoked by the data subject.

Type: Application

Filed: April 16, 2021

Publication date: October 20, 2022

Inventors: Roger C. Raphael, Rajesh M. Desai, Scott Schumacher, Angineh Aghakiant
Learning to Transform Sensitive Data with Variable Distribution Preservation

Publication number: 20220311749

Abstract: Preserving distributions of data values of a data asset in a data anonymization operation is provided. Anonymizing data values is performed by transforming sensitive data in a set of columns over rows of the data asset while preserving distribution of the data values in the set of transformed columns to a defined degree using a set of autoencoders and loss function. The autoencoders are base trained from preexisting data in a data assets catalog and actively trained during data dissemination. Parametric coefficients of the loss function are configured and the threshold is generated using policies from an enforcement decision for the data asset and data consumer. The loss function value of a selected row is compared to the threshold. Transformed data values of the selected row are transcribed to an output row when the loss function value is greater than the threshold and disseminated to the data consumer.

Type: Application

Filed: March 24, 2021

Publication date: September 29, 2022

Inventors: Arjun Natarajan, ASHISH KUNDU, Roger C. Raphael, Aniya Aggarwal, Rajesh M. Desai, Joshua F. Payne, Mu Qiao
Managing large scale association sets using optimized bit map representations

Patent number: 11372831

Abstract: Processing a database query for sets of data includes assigning a unique identifier from an integer space to each entity within data and creating one or more sets of entities each pertaining to a corresponding entity within the data. A representation is then generated on disk for each set of entities, wherein each representation encompasses and is suited for a range of the unique identifiers of entities within a corresponding set and indicates a presence of an entity within that corresponding set. Finally, a query is processed based on the representation for each set of entities to retrieve data satisfying the query, wherein the representation provides a constant time for association and dissociation operations that are append-only operations with deferred merge and automatic filtering of deleted and duplicate entities at query time.

Type: Grant

Filed: July 29, 2019

Date of Patent: June 28, 2022

Assignee: International Business Machines Corporation

Inventors: Rajesh M. Desai, Magesh Jayapandian, Iun V. Leong, Justo L. Perez, Roger C. Raphael, Gabriel Valencia
Real-time policy rule evaluation with multistage processing

Patent number: 11362997

Abstract: A method, apparatus, system, and computer program product evaluate an information asset with a corpus of policies in conjunction with the context of access including a specific user. A large corresponding set of rules in the policy corpus are identified by computer system. A continuous process of rule evaluation occurs against information asset metadata wherein a series of processing including set of common subexpressions between the predicates of all active rules, pre-evaluation, compaction and storage are identified by the computer system in the policy and rule corpus. Metadata for the information asset is applied by the computer system to the set of common subexpressions to form partially evaluated rules for the policy. The partially evaluated rules henceforth compacted are stored by the computer system in association with the information asset.

Type: Grant

Filed: October 16, 2019

Date of Patent: June 14, 2022

Assignee: International Business Machines Corporation

Inventors: Roger C. Raphael, Rajesh M. Desai, Iun Veng Leong, Brian Joseph Owings
Detecting and obfuscating sensitive data in unstructured text

Patent number: 11347891

Abstract: Disclosed is a computer-implemented method to identify and anonymize personal information, the method comprising analyzing a first corpus with a personal information sniffer, wherein the first corpus includes unstructured text, wherein the personal information sniffer is configured to detect a set of types of personal information, and wherein the personal information sniffer produces a first set of results. The method comprises analyzing the first corpus with a set of annotators, wherein each annotator is configured to identify all instances of a type of personal information in the corpus, and wherein the set of annotators produces a second set of results. The method comprises comparing the first set of results and the second set of results, determining, the first set of results does not match the second set of results, and updating, based on the determining, the personal information sniffer.

Type: Grant

Filed: June 19, 2019

Date of Patent: May 31, 2022

Assignee: International Business Machines Corporation

Inventors: Roger C. Raphael, Rajesh M. Desai, Iun Veng Leong, Ramakanta Samal, Ansel Blume
Detection of sensitive personal information in a storage device

Patent number: 11308235

Abstract: A method, system and computer program product for detecting sensitive personal information in a storage device. A block delta list containing a list of changed blocks in the storage device is processed. After identifying the changed blocks from the block delta list, a search is performed on those identified changed blocks for sensitive personal information using a character scanning technique. After identifying a changed block deemed to contain sensitive personal information, the changed block is translated from the block level to the file level using a hierarchical reverse mapping technique. By only analyzing the changed blocks to determine if they contain sensitive personal information, a lesser quantity of blocks needs to be processed in order to detect sensitive personal information in the storage device in near real-time. In this manner, sensitive personal information is detected in the storage device using fewer computing resources in a shorter amount of time.

Type: Grant

Filed: March 6, 2020

Date of Patent: April 19, 2022

Assignee: International Business Machines Corporation

Inventors: Rajesh M. Desai, Mu Qiao, Roger C. Raphael, Ramani Routray
Providing near real-time and effective litigation management for multiple remote content systems using asynchronous bi-directional replication pipelines

Patent number: 11250527

Abstract: Embodiments generally relate to providing litigation management for multiple remote content systems using asynchronous bi-directional replication pipelines. In some embodiments, a method includes retrieving, at one or more inbound replicators of one or more respective bi-directional pipelines, metadata associated with documents stored in one or more content repositories. The method further includes resolving, at a governance control hub, conflicts associated with legal holds on one or more of the documents based on the metadata. The method further includes sending conflict resolution results from one or more outbound applicators of the bi-directional pipelines to the content repositories, where the content repositories enforce legal holds on the documents.

Type: Grant

Filed: June 18, 2019

Date of Patent: February 15, 2022

Assignee: International Business Machines Corporation

Inventors: Roger C. Raphael, Ronald L. Rathgeber, Rajesh M. Desai, Gabriel Valencia, Justo Perez, William Russell Belknap, Sudhakar Basireddy

1 2 3 4 5 next