Patents by Inventor Marcia Lucas Pesce

Marcia Lucas Pesce has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Automatic indexing of relevant domains in a data lake for data discovery and integration

Patent number: 11120031

Abstract: Techniques are provided for data discovery and data integration in a data lake. One method comprises obtaining data files from a data lake, wherein each data file comprises multiple records having multiple fields; selecting multiple candidate fields from a data file based on a record type; determining a relevance score for each candidate field from the data file based on multiple features extracted from the data file; and clustering the scored candidate fields into clusters of similar domains using a hashing algorithm, wherein a given cluster comprises candidate fields, wherein multiple data files can be integrated based on a domain of the candidate fields in the given cluster. The relevance score for each candidate field is based on multiple features comprising, for example, features that take into account a morphological or semantic similarity between file name, file metadata and/or file records and features that consider statistics of candidate fields in a data file.

Type: Grant

Filed: October 31, 2019

Date of Patent: September 14, 2021

Assignee: EMC IP Holding Company LLC

Inventors: Adriana Bechara Prado, Vitor Silva Sousa, Marcia Lucas Pesce, Paulo de Figueiredo Pires, Fábio André Machado Porto, Altobelli de Brito Mantuan, Rodolpho Rosa da Silva, Wagner dos Santos Vieira
Automatic Indexing of Relevant Domains in a Data Lake for Data Discovery and Integration

Publication number: 20210133189

Abstract: Techniques are provided for data discovery and data integration in a data lake. One method comprises obtaining data files from a data lake, wherein each data file comprises multiple records having multiple fields; selecting multiple candidate fields from a data file based on a record type; determining a relevance score for each candidate field from the data file based on multiple features extracted from the data file; and clustering the scored candidate fields into clusters of similar domains using a hashing algorithm, wherein a given cluster comprises candidate fields, wherein multiple data files can be integrated based on a domain of the candidate fields in the given cluster. The relevance score for each candidate field is based on multiple features comprising, for example, features that take into account a morphological or semantic similarity between file name, file metadata and/or file records and features that consider statistics of candidate fields in a data file.

Type: Application

Filed: October 31, 2019

Publication date: May 6, 2021

Inventors: Adriana Bechara Prado, Vítor Silva Sousa, Marcia Lucas Pesce, Paulo de Figueiredo Pires, Fábio André Machado Porto, Altobelli de Brito Mantuan, Rodolpho Rosa da Silva, Wagner dos Santos Vieira

Automatic indexing of relevant domains in a data lake for data discovery and integration

Automatic Indexing of Relevant Domains in a Data Lake for Data Discovery and Integration