Patents by Inventor Rajesh M. Desai
Rajesh M. Desai has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20250117405Abstract: A computer-implemented process for generating cluster templates used for creating extraction models includes the following operations. A plurality of training files associated with a selected class are received. An automated visual analysis is performed on each of the plurality of training files. An automated contextual analysis is performed on each of the plurality of training files. A first clustering of the plurality of training files into a first plurality of clusters using results from the automated visual analysis is performed. A second clustering of one of the plurality of clusters into a second plurality of clusters is performed using results from the automated contextual analysis. Cluster templates for the first and second plurality of clusters are generated.Type: ApplicationFiled: October 6, 2023Publication date: April 10, 2025Inventors: Shalin Avlani, Rajesh M. Desai, Mayank Vipin Shah, Xiaoying Gao
-
Patent number: 12259920Abstract: Disclosed embodiments provide techniques for monitoring and evaluating the effectiveness of key value pairs (KVPs) used in a document processing system. In embodiments, KVPs are obtained from multiple extractors of a document processing system. A score is computed for the KVPs by computing an effectiveness metric for each KVP from the multiple KVPs. In response to the computed score being below a predetermined threshold, a model retraining process is performed to generate a new set of KVP extractors, and provide the new set of KVPs to the document processing system.Type: GrantFiled: September 7, 2023Date of Patent: March 25, 2025Assignee: International Business Machines CorporationInventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Yang Zhong Li, Rajesh M. Desai, Xue Lan Zhang
-
Publication number: 20250086222Abstract: Disclosed embodiments provide techniques for monitoring and evaluating the effectiveness of key value pairs (KVPs) used in a document processing system. In embodiments, KVPs are obtained from multiple extractors of a document processing system. A score is computed for the KVPs by computing an effectiveness metric for each KVP from the multiple KVPs. In response to the computed score being below a predetermined threshold, a model retraining process is performed to generate a new set of KVP extractors, and provide the new set of KVPs to the document processing system.Type: ApplicationFiled: September 7, 2023Publication date: March 13, 2025Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Yang Zhong Li, Rajesh M. Desai, Xue Lan Zhang
-
Publication number: 20240420296Abstract: Mechanisms are provided for automated document image annotation and data extraction. A received document image is processed to identify a document type of the received document image, and retrieve a corresponding document template having key point location data and annotation location data for documents of the document type. First key points of the received document image are matched with second key points of the corresponding document template and a mapping is generated to map locations of the document template to locations of the received document image. A perspective transformation is applied, based on the mapping, to first annotation locations specified in the document template data structure to generate second annotation locations corresponding to locations in the received document image. Data extraction is performed on data associated with the second annotation locations based on the annotations corresponding to the second annotation locations.Type: ApplicationFiled: June 14, 2023Publication date: December 19, 2024Inventors: Kai Zhang, Xiaoying Gao, Rajesh M. Desai, Sudhakar Basireddy
-
Patent number: 12056948Abstract: In an approach, a processor identifies a plurality of text separators in a borderless table, a text separator of the plurality of text separators defining a non-text region between two consecutive text lines in the borderless table. A processor classifies the plurality of text separators into a number of target clusters comprised in a target group based on property information related to the plurality of text separators, the number of target clusters corresponding to a number of separator types. A processor provides indication information to indicate respective separator types of the plurality of text separators based on a result of the classifying.Type: GrantFiled: July 19, 2021Date of Patent: August 6, 2024Assignee: International Business Machines CorporationInventors: Ang Yi, Nazrul Islam, Rajesh M. Desai, Jing Zhang, Dong Rui Li, Xue Mei Deng, Ye Chen, Hai Cheng Wang
-
Publication number: 20240193978Abstract: Computer implemented methods, systems, and computer program products include program code executing on a processor(s) that merges a document comprising multiple pages into a single document image. The program code processes the single document image to identify structural elements and textual content. The program code compares the structural elements of the single document image to other structural elements of a group of document templates stored in a database to identify a subset of the group of documents templates with a threshold number of similarities to the single document image. The program code generates, from the single document image, a graph structure representing the document, where the graph structure comprises visual information and connections related to the structural elements and concepts comprising the textual content. The program code uses the structure to identify a document template that is a closest match to the document.Type: ApplicationFiled: December 13, 2022Publication date: June 13, 2024Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Rajesh M. Desai, Yang Zhong Li, Ye Chen
-
Publication number: 20240046677Abstract: A computer-implemented method for text block segmentation includes determining a first text block segmentation pattern utilized to generate a segmented text block based, at least in part, on a comparison of semantic information associated with the segmented text block and a plurality of predefined types of text block segmentation patterns indicated by a graph; calculating a first degree of confidence in a size of the segmented text block based, at least in part, on comparing semantic entities associated with the segmented text block with semantic entities indicated by leaf nodes stemming from a first non-leaf node included in the graph and representative of the first type of text block segmentation pattern; and determining that the size of the segmented text block is non-optimal based on the calculated degree of confidence in the size of the segmented text block being below a predetermined threshold.Type: ApplicationFiled: July 26, 2022Publication date: February 8, 2024Inventors: Ang Yi, Jing Zhang, Hai Cheng Wang, Jun Hong Zhao, Rajesh M. Desai, Yang Zhong Li, Xue Xu
-
Patent number: 11842524Abstract: A mechanism is provided for implementing an optical character recognition (OCR) error correction mechanism for correcting OCR errors. Responsive to receiving a document in which OCR has been performed, the mechanism assesses the document to identify a set of OCR errors generated by an OCR engine that performed the OCR using a set of visual embeddings. Responsive to identifying the set of OCR errors, the mechanism analyzes each character of a plurality of sentences within the document to generate a high-dimensional embedding for the characters of the plurality of sentences within the document. The mechanism then linguistically corrects each OCR error in the set of OCR error. The mechanism utilizes ground truth information and the set of visual embeddings to verify that character stream is linguistically correct. Responsive to verifying that the character stream is linguistically correct, the mechanism outputs an OCR error corrected document to a user.Type: GrantFiled: April 30, 2021Date of Patent: December 12, 2023Assignee: International Business Machines CorporationInventors: Rajesh M. Desai, Ayush Utkarsh, Nazrul Islam, Praveen Vyas
-
Patent number: 11829424Abstract: Discovering second-order documents and latent custodians in an e-discovery system is provided. A list of first-order documents and document custodians within a base state of the e-discovery system are identified based on a plurality of terms corresponding to a meet and confer practice for a legal matter instance. The plurality of terms is masked within the first-order documents. The first-order documents having the plurality of terms masked are divided into groups. A list of second-order documents is generated from a group of documents. A list of second-order document custodians is generated based on corresponding custodian relationships to second-order documents. Finally, each second-order document custodian in the list of second-order document custodians that has a corresponding rank exceeding a defined rank threshold level is identified as an official document custodian in the e-discovery system.Type: GrantFiled: February 20, 2020Date of Patent: November 28, 2023Assignee: International Business Machines CorporationInventors: Roger C. Raphael, Rajesh M. Desai, Nazrul Islam, Magesh Jayapandian, Jojo Joseph
-
Patent number: 11741258Abstract: Dynamic data dissemination is provided. A resolved data subject identifier corresponding to a data subject is selected from a set of resolved data subject identifiers existing in rows of a data asset. In response to determining that the resolved data subject identifier does not correspond to a right to forget list, it is determined that the resolved data subject identifier corresponds to a data subject request list. The rows are transformed to anonymize existing pseudo and personal identifiers in cells of the rows that are tied to columns associated with data classes for which specific consent dimensions have been indicated as revoked by the data subject.Type: GrantFiled: April 16, 2021Date of Patent: August 29, 2023Assignee: International Business Machines CorporationInventors: Roger C. Raphael, Rajesh M. Desai, Scott Schumacher, Angineh Aghakiant
-
Patent number: 11647004Abstract: Preserving distributions of data values of a data asset in a data anonymization operation is provided. Anonymizing data values is performed by transforming sensitive data in a set of columns over rows of the data asset while preserving distribution of the data values in the set of transformed columns to a defined degree using a set of autoencoders and loss function. The autoencoders are base trained from preexisting data in a data assets catalog and actively trained during data dissemination. Parametric coefficients of the loss function are configured and the threshold is generated using policies from an enforcement decision for the data asset and data consumer. The loss function value of a selected row is compared to the threshold. Transformed data values of the selected row are transcribed to an output row when the loss function value is greater than the threshold and disseminated to the data consumer.Type: GrantFiled: March 24, 2021Date of Patent: May 9, 2023Assignee: International Business Machines CorporationInventors: Arjun Natarajan, Ashish Kundu, Roger C. Raphael, Aniya Aggarwal, Rajesh M. Desai, Joshua F. Payne, Mu Qiao
-
Publication number: 20230012784Abstract: In an approach, a processor identifies a plurality of text separators in a borderless table, a text separator of the plurality of text separators defining a non-text region between two consecutive text lines in the borderless table. A processor classifies the plurality of text separators into a number of target clusters comprised in a target group based on property information related to the plurality of text separators, the number of target clusters corresponding to a number of separator types. A processor provides indication information to indicate respective separator types of the plurality of text separators based on a result of the classifying.Type: ApplicationFiled: July 19, 2021Publication date: January 19, 2023Inventors: Ang Yi, Nazrul Islam, Rajesh M. Desai, Jing Zhang, Dong Rui Li, Xue Mei Deng, Ye Chen, Hai Cheng Wang
-
Publication number: 20220350998Abstract: A mechanism is provided for implementing an optical character recognition (OCR) error correction mechanism for correcting OCR errors. Responsive to receiving a document in which OCR has been performed, the mechanism assesses the document to identify a set of OCR errors generated by an OCR engine that performed the OCR using a set of visual embeddings. Responsive to identifying the set of OCR errors, the mechanism analyzes each character of a plurality of sentences within the document to generate a high-dimensional embedding for the characters of the plurality of sentences within the document. The mechanism then linguistically corrects each OCR error in the set of OCR error. The mechanism utilizes ground truth information and the set of visual embeddings to verify that character stream is linguistically correct. Responsive to verifying that the character stream is linguistically correct, the mechanism outputs an OCR error corrected document to a user.Type: ApplicationFiled: April 30, 2021Publication date: November 3, 2022Inventors: Rajesh M. Desai, Ayush Utkarsh, Nazrul Islam, Praveen Vyas
-
Publication number: 20220335156Abstract: Dynamic data dissemination is provided. A resolved data subject identifier corresponding to a data subject is selected from a set of resolved data subject identifiers existing in rows of a data asset. In response to determining that the resolved data subject identifier does not correspond to a right to forget list, it is determined that the resolved data subject identifier corresponds to a data subject request list. The rows are transformed to anonymize existing pseudo and personal identifiers in cells of the rows that are tied to columns associated with data classes for which specific consent dimensions have been indicated as revoked by the data subject.Type: ApplicationFiled: April 16, 2021Publication date: October 20, 2022Inventors: Roger C. Raphael, Rajesh M. Desai, Scott Schumacher, Angineh Aghakiant
-
Publication number: 20220311749Abstract: Preserving distributions of data values of a data asset in a data anonymization operation is provided. Anonymizing data values is performed by transforming sensitive data in a set of columns over rows of the data asset while preserving distribution of the data values in the set of transformed columns to a defined degree using a set of autoencoders and loss function. The autoencoders are base trained from preexisting data in a data assets catalog and actively trained during data dissemination. Parametric coefficients of the loss function are configured and the threshold is generated using policies from an enforcement decision for the data asset and data consumer. The loss function value of a selected row is compared to the threshold. Transformed data values of the selected row are transcribed to an output row when the loss function value is greater than the threshold and disseminated to the data consumer.Type: ApplicationFiled: March 24, 2021Publication date: September 29, 2022Inventors: Arjun Natarajan, ASHISH KUNDU, Roger C. Raphael, Aniya Aggarwal, Rajesh M. Desai, Joshua F. Payne, Mu Qiao
-
Patent number: 11372831Abstract: Processing a database query for sets of data includes assigning a unique identifier from an integer space to each entity within data and creating one or more sets of entities each pertaining to a corresponding entity within the data. A representation is then generated on disk for each set of entities, wherein each representation encompasses and is suited for a range of the unique identifiers of entities within a corresponding set and indicates a presence of an entity within that corresponding set. Finally, a query is processed based on the representation for each set of entities to retrieve data satisfying the query, wherein the representation provides a constant time for association and dissociation operations that are append-only operations with deferred merge and automatic filtering of deleted and duplicate entities at query time.Type: GrantFiled: July 29, 2019Date of Patent: June 28, 2022Assignee: International Business Machines CorporationInventors: Rajesh M. Desai, Magesh Jayapandian, Iun V. Leong, Justo L. Perez, Roger C. Raphael, Gabriel Valencia
-
Patent number: 11362997Abstract: A method, apparatus, system, and computer program product evaluate an information asset with a corpus of policies in conjunction with the context of access including a specific user. A large corresponding set of rules in the policy corpus are identified by computer system. A continuous process of rule evaluation occurs against information asset metadata wherein a series of processing including set of common subexpressions between the predicates of all active rules, pre-evaluation, compaction and storage are identified by the computer system in the policy and rule corpus. Metadata for the information asset is applied by the computer system to the set of common subexpressions to form partially evaluated rules for the policy. The partially evaluated rules henceforth compacted are stored by the computer system in association with the information asset.Type: GrantFiled: October 16, 2019Date of Patent: June 14, 2022Assignee: International Business Machines CorporationInventors: Roger C. Raphael, Rajesh M. Desai, Iun Veng Leong, Brian Joseph Owings
-
Patent number: 11347891Abstract: Disclosed is a computer-implemented method to identify and anonymize personal information, the method comprising analyzing a first corpus with a personal information sniffer, wherein the first corpus includes unstructured text, wherein the personal information sniffer is configured to detect a set of types of personal information, and wherein the personal information sniffer produces a first set of results. The method comprises analyzing the first corpus with a set of annotators, wherein each annotator is configured to identify all instances of a type of personal information in the corpus, and wherein the set of annotators produces a second set of results. The method comprises comparing the first set of results and the second set of results, determining, the first set of results does not match the second set of results, and updating, based on the determining, the personal information sniffer.Type: GrantFiled: June 19, 2019Date of Patent: May 31, 2022Assignee: International Business Machines CorporationInventors: Roger C. Raphael, Rajesh M. Desai, Iun Veng Leong, Ramakanta Samal, Ansel Blume
-
Patent number: 11308235Abstract: A method, system and computer program product for detecting sensitive personal information in a storage device. A block delta list containing a list of changed blocks in the storage device is processed. After identifying the changed blocks from the block delta list, a search is performed on those identified changed blocks for sensitive personal information using a character scanning technique. After identifying a changed block deemed to contain sensitive personal information, the changed block is translated from the block level to the file level using a hierarchical reverse mapping technique. By only analyzing the changed blocks to determine if they contain sensitive personal information, a lesser quantity of blocks needs to be processed in order to detect sensitive personal information in the storage device in near real-time. In this manner, sensitive personal information is detected in the storage device using fewer computing resources in a shorter amount of time.Type: GrantFiled: March 6, 2020Date of Patent: April 19, 2022Assignee: International Business Machines CorporationInventors: Rajesh M. Desai, Mu Qiao, Roger C. Raphael, Ramani Routray
-
Patent number: 11250527Abstract: Embodiments generally relate to providing litigation management for multiple remote content systems using asynchronous bi-directional replication pipelines. In some embodiments, a method includes retrieving, at one or more inbound replicators of one or more respective bi-directional pipelines, metadata associated with documents stored in one or more content repositories. The method further includes resolving, at a governance control hub, conflicts associated with legal holds on one or more of the documents based on the metadata. The method further includes sending conflict resolution results from one or more outbound applicators of the bi-directional pipelines to the content repositories, where the content repositories enforce legal holds on the documents.Type: GrantFiled: June 18, 2019Date of Patent: February 15, 2022Assignee: International Business Machines CorporationInventors: Roger C. Raphael, Ronald L. Rathgeber, Rajesh M. Desai, Gabriel Valencia, Justo Perez, William Russell Belknap, Sudhakar Basireddy