Patents by Inventor Khaled El Emam

Khaled El Emam has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

DETERMINING JOURNALIST RISK OF A DATASET USING POPULATION EQUIVALENCE CLASS DISTRIBUTION ESTIMATION

Publication number: 20230307104

Abstract: Methods and systems to de-identify a longitudinal dataset of personal records based on journalistic risk computed from a sample set of the personal records, including determining a similarity distribution of the sample set based on quasi-identifiers of the respective personal records, converting the similarity distribution of the sample set to an equivalence class distribution, and computing journalistic risk based on the equivalence distribution. In an embodiment, multiple similarity measures are determined for a personal record based on comparisons with multiple combinations of other personal records of the sample set, and an average of the multiple similarity measures is rounded. In an embodiment, similarity measures are determined for a subset of the sample set and, for each similarity measure, the number of records having the similarity measure is projected to the subset of personal records. Journalistic risk may be computed for multiple types of attacks.

Type: Application

Filed: May 26, 2023

Publication date: September 28, 2023

Inventors: Stephen Korte, Luk Arbuckle, Andrew Baker, Khaled El Emam, Sean Rose
SYSTEMS AND METHODS OF DATA TRANSFORMATION FOR DATA POOLING

Publication number: 20230237196

Abstract: A data anonymization pipeline system for managing holding and pooling data is disclosed. The data anonymization pipeline system transforms personal data at a source and then stores the transformed data in a safe environment. Furthermore, a re-identification risk assessment is performed before providing access to a user to fetch the de-identified data for secondary purposes.

Type: Application

Filed: March 30, 2023

Publication date: July 27, 2023

Inventors: Lon Michel Luk Arbuckle, Jordan Elijah Collins, Khaldoun Zine El Abidine, Khaled El Emam
Determining journalist risk of a dataset using population equivalence class distribution estimation

Patent number: 11664098

Abstract: Methods and systems to de-identify a longitudinal dataset of personal records based on journalistic risk computed from a sample set of the personal records, including determining a similarity distribution of the sample set based on quasi-identifiers of the respective personal records, converting the similarity distribution of the sample set to an equivalence class distribution, and computing journalistic risk based on the equivalence distribution. In an embodiment, multiple similarity measures are determined for a personal record based on comparisons with multiple combinations of other personal records of the sample set, and an average of the multiple similarity measures is rounded. In an embodiment, similarity measures are determined for a subset of the sample set and, for each similarity measure, the number of records having the similarity measure is projected to the subset of personal records. Journalistic risk may be computed for multiple types of attacks.

Type: Grant

Filed: December 23, 2021

Date of Patent: May 30, 2023

Assignee: PRIVACY ANALYTICS INC.

Inventors: Stephen Korte, Luk Arbuckle, Andrew Baker, Khaled El Emam, Sean Rose
Systems and methods of data transformation for data pooling

Patent number: 11620408

Abstract: A data anonymization pipeline system for managing holding and pooling data is disclosed. The data anonymization pipeline system transforms personal data at a source and then stores the transformed data in a safe environment. Furthermore, a re-identification risk assessment is performed before providing access to a user to fetch the de-identified data for secondary purposes.

Type: Grant

Filed: March 27, 2020

Date of Patent: April 4, 2023

Assignee: Privacy Analytics Inc.

Inventors: Lon Michel Luk Arbuckle, Jordan Elijah Collins, Khaldoun Zine El Abidine, Khaled El Emam
GEO-CLUSTERING FOR DATA DE-IDENTIFICATION

Publication number: 20220293280

Abstract: Methods and systems to de-identify data records, including to merge pairs of clusters data records of individuals until a number of data records of each cluster meets a minimum size threshold, de-identify the clusters when each cluster meets the minimum size threshold, assess a risk of re-identification of the de-identified clusters based on k-anonymity, increase the minimum size threshold and re-perform the merge, the de-identify, and the assess a risk, if the assessed risk does not meet a risk criterion, and present the de-identified clusters on a display when the assessed risk meets the risk criterion.

Type: Application

Filed: May 31, 2022

Publication date: September 15, 2022

Inventors: Andrew Richard Baker, Khaled El Emam
SYSTEM AND METHOD FOR GENERATING SYNTHETIC LONGITUDINAL DATA

Publication number: 20220238231

Abstract: Longitudinal data can be synthesized by first generating baseline characteristics and first event values for a plurality of synthetic individuals. The baseline characteristics and first event values are used to synthesize a plurality of subsequent events.

Type: Application

Filed: January 25, 2022

Publication date: July 28, 2022

Inventors: Khaled EL EMAM, Lucy Mosquera, Cem Subakan
Geo-clustering for data de-identification

Patent number: 11380441

Abstract: The present disclosure is related to a method of geo-clustering of data for de-identification of a dataset. The method includes generating a plurality of geoclusters based on a plurality of geocodes. The geocodes may include ZIP codes or postal codes. The method further includes identifying the geoclusters having the smallest population. The geocluster having the smallest population is iteratively merged with the nearest geocluster until a minimum population threshold is met. Once the smallest geocluster meets the minimum population threshold, the plurality of geoclusters can be used to cluster the geocodes within a dataset to be de-identified.

Type: Grant

Filed: May 10, 2017

Date of Patent: July 5, 2022

Assignee: PRIVACY ANALYTICS INC.

Inventors: Andrew Richard Baker, Khaled El Emam
DETERMINING JOURNALIST RISK OF A DATASET USING POPULATION EQUIVALENCE CLASS DISTRIBUTION ESTIMATION

Publication number: 20220115101

Abstract: Methods and systems to de-identify a longitudinal dataset of personal records based on journalistic risk computed from a sample set of the personal records, including determining a similarity distribution of the sample set based on quasi-identifiers of the respective personal records, converting the similarity distribution of the sample set to an equivalence class distribution, and computing journalistic risk based on the equivalence distribution. In an embodiment, multiple similarity measures are determined for a personal record based on comparisons with multiple combinations of other personal records of the sample set, and an average of the multiple similarity measures is rounded. In an embodiment, similarity measures are determined for a subset of the sample set and, for each similarity measure, the number of records having the similarity measure is projected to the subset of personal records. Journalistic risk may be computed for multiple types of attacks.

Type: Application

Filed: December 23, 2021

Publication date: April 14, 2022

Inventors: Stephen Korte, Luk Arbuckle, Andrew Baker, Khaled El Emam, Sean Rose
RE-IDENTIFICATION RISK ASSESSMENT USING A SYNTHETIC ESTIMATOR

Publication number: 20220050917

Abstract: A risk of re-identifying a particular individual associated with a record in a dataset can be assessed by synthesizing a dataset from a dataset to be shared and then sampling a synthetic microdata dataset from the synthetic dataset. The synthetic dataset and the synthetic microdata dataset can then be used to estimate the risk of re-identifying an individual from the dataset to be shared.

Type: Application

Filed: August 12, 2021

Publication date: February 17, 2022

Inventors: Yangdi JIANG, Bei JIANG, Linglong KONG, Khaled EL EMAM
Determining journalist risk of a dataset using population equivalence class distribution estimation

Patent number: 11238960

Abstract: A system, method and computer readable memory for determining journalist risk of a dataset using population equivalence class distribution estimation. The dataset may be a cross-sectional data set or a longitudinal dataset. The determine risk of identification can be determined and used in de-identification process of the dataset.

Type: Grant

Filed: November 27, 2015

Date of Patent: February 1, 2022

Assignee: Privacy Analytics Inc.

Inventors: Stephen Korte, Luk Arbuckle, Andrew Baker, Khaled El Emam, Sean Rose
OPTIMIZING GENERATION OF SYNTHETIC DATA

Publication number: 20210374128

Abstract: Synthetic data may be used in place of an original dataset to avoid or mitigate disclosure risks pertaining to information of the original dataset. Synthetic data may be generated by optimizing a variable ordering used by a sequential tree generation method. The loss function used in optimizing may be based on a distinguishability between the source data and generated synthetic data.

Type: Application

Filed: June 1, 2021

Publication date: December 2, 2021

Inventors: Khaled EL EMAM, Lucy MOSQUERA, Chaoyi ZHENG
SYSTEMS AND METHOD FOR EVALUATING IDENTITY DISCLOSURE RISKS IN SYNTHETIC PERSONAL DATA

Publication number: 20210326475

Abstract: Although synthetic data synthesized from real sample data may not have a direct matching between synthetic data and individuals, there may still be a risk with identity disclosure. The identity disclosure risks associated with fully synthetic data may be assessed.

Type: Application

Filed: April 19, 2021

Publication date: October 21, 2021

Inventors: Khaled EL EMAM, Lucy MOSQUERA
SYSTEMS AND METHODS OF DATA TRANSFORMATION FOR DATA POOLING

Publication number: 20200311308

Abstract: A data anonymization pipeline system for managing holding and pooling data is disclosed. The data anonymization pipeline system transforms personal data at a source and then stores the transformed data in a safe environment. Furthermore, a re-identification risk assessment is performed before providing access to a user to fetch the de-identified data for secondary purposes.

Type: Application

Filed: March 27, 2020

Publication date: October 1, 2020

Inventors: Lon Michel Luk Arbuckle, Jordan Elijah Collins, Khaldoun Zine El Abidine, Khaled El Emam
Re-identification risk measurement estimation of a dataset

Patent number: 10685138

Abstract: There is provided a system and method executed by a processor for estimating re-identification risk of a single individual in a dataset. The individual, subject or patient is described by a data subject profile such as a record in the dataset. A population distribution is retrieved from a storage device, the population distribution is determined by one or more quasi-identifying fields identified in the data subject profile. An information score is then assigned to each quasi-identifying (QI) value of the one or more quasi-identifying fields associated with the data subject profile. The assigned information scores of the quasi-identifying values for the data subject profile are aggregated into an aggregated information value. An anonymity value is then calculated from the aggregated information value and a size of a population associated with the dataset. A re-identification metric for the individual from the anonymity value is then calculated.

Type: Grant

Filed: April 1, 2016

Date of Patent: June 16, 2020

Assignee: PRIVACY ANALYTICS INC.

Inventors: Martin Scaiano, Stephen Korte, Andrew Baker, Geoffrey Green, Khaled El Emam, Luk Arbuckle
Methods and systems for watermarking of anonymized datasets

Patent number: 10424406

Abstract: A method includes receiving an initial dataset. Each record of the initial dataset comprises a set of quasi-identifier attributes and a set of non-quasi-identifier attributes. A processor assigns a link identifier to each record and replaces each set of quasi-identifier attributes with a range to form a generalized set. The processor removes duplicate records based on identical generalized sets to generate de-duplicated records. The processor generates a randomized record by replacing the generalized set of each de-duplicated record with a corresponding set of random values. The processor passes the set of random values of each randomized record through multiple hash functions to generate multiple outputs. The multiple outputs are mapped to a Bloom filter. The processor forms a dataset by combining each randomized record with one or more sets of non-quasi-identifier attributes. The set of random values is a fingerprint for a corresponding record of the dataset.

Type: Grant

Filed: February 12, 2017

Date of Patent: September 24, 2019

Assignee: PRIVACY ANALYTICS INC.

Inventors: Yasser Jafer, Khaled El Emam
System and method to reduce a risk of re-identification of text de-identification tools

Patent number: 10395059

Abstract: A computer-implemented system and method to reduce re-identification risk of a data set. The method includes the steps of retrieving, via a database-facing communication channel, a data set from a database communicatively coupled to the processor, the data set selected to include patient medical records that meet a predetermined criteria; identifying, by a processor coupled to a memory, direct identifiers in the data set; identifying, by the processor, quasi-identifiers in the data set; calculating, by the processor, a first probability of re-identification from the direct identifiers; calculating, by the processor, a second probability of re-identification from the quasi-direct identifiers; perturbing, by the processor, the data set if one of the first probability or second probability exceeds a respective predetermined threshold, to produce a perturbed data set; and providing, via a user-facing communication channel, the perturbed data set to the requestor.

Type: Grant

Filed: March 7, 2017

Date of Patent: August 27, 2019

Assignee: PRIVACY ANALYTICS INC.

Inventors: Martin Scaiano, Grant Middleton, Varada Kolhatkar, Khaled El Emam
Asymmetric journalist risk model of data re-identification

Patent number: 10242213

Abstract: System and method to produce an anonymized cohort, members of the cohort having less than a predetermined risk of re-identification. The system includes a user-facing communication interface to receive an anonymized cohort request comprising traits to include in members of the cohort; a data source-facing communication channel to query a data source, to find anonymized records that possess at least some of the requested traits; and a processor programmed to carry out the instructions of: forming a dataset from at least some of the anonymized records; calculating a risk of re-identification of the anonymized records in the dataset based upon the data query; perturbing anonymized records in the dataset that exceed a predetermined risk of re-identification, until the risk of re-identification is not greater than the pre-determined threshold, to produce the anonymized cohort; and providing, via a user-facing communication channel, the anonymized cohort.

Type: Grant

Filed: September 21, 2016

Date of Patent: March 26, 2019

Assignee: PRIVACY ANALYTICS INC.

Inventors: Martin Scaiano, Andrew Baker, Stephen Korte, Khaled El Emam
METHODS AND SYSTEMS FOR WATERMARKING OF ANONYMIZED DATASETS

Publication number: 20180232488

Abstract: A method includes receiving an initial dataset. Each record of the initial dataset comprises a set of quasi-identifier attributes and a set of non-quasi-identifier attributes. A processor assigns a link identifier to each record and replaces each set of quasi-identifier attributes with a range to form a generalized set. The processor removes duplicate records based on identical generalized sets to generate de-duplicated records. The processor generates a randomized record by replacing the generalized set of each de-duplicated record with a corresponding set of random values. The processor passes the set of random values of each randomized record through multiple hash functions to generate multiple outputs. The multiple outputs are mapped to a Bloom filter. The processor forms a dataset by combining each randomized record with one or more sets of non-quasi-identifier attributes. The set of random values is a fingerprint for a corresponding record of the dataset.

Type: Application

Filed: February 12, 2017

Publication date: August 16, 2018

Inventors: Yasser Jafer, Khaled El Emam
Method of re-identification risk measurement and suppression on a longitudinal dataset

Patent number: 9990515

Abstract: In longitudinal datasets, it is usually unrealistic that an adversary would know the value of every quasi-identifier. De-identifying a dataset under this assumption results in high levels of generalization and suppression as every patient is unique. Adversary power gives an upper bound on the number of values an adversary knows about a patient. Considering all subsets of quasi-identifiers with the size of the adversary power is computationally infeasible. A method is provided to assess re-identification risk by determining a representative risk which can be used as a proxy for the overall risk measurement and enable suppression of identifiable quasi-identifiers.

Type: Grant

Filed: November 30, 2015

Date of Patent: June 5, 2018

Assignee: PRIVACY ANALYTICS INC.

Inventors: Andrew Baker, Luk Arbuckle, Khaled El Emam, Ben Eze, Stephen Korte, Sean Rose, Cristina Ilie
RE-IDENTIFICATION RISK MEASUREMENT ESTIMATION OF A DATASET

Publication number: 20180114037

Abstract: There is provided a system and method executed by a processor for estimating re-identification risk of a single individual in a dataset. The individual, subject or patient is described by a data subject profile such as a record in the dataset. A population distribution is retrieved from a storage device, the population distribution is determined by one or more quasi-identifying fields identified in the data subject profile. An information score is then assigned to each quasi-identifying (QI) value of the one or more quasi-identifying fields associated with the data subject profile. The assigned information scores of the quasi-identifying values for the data subject profile are aggregated into an aggregated information value. An anonymity value is then calculated from the aggregated information value and a size of a population associated with the dataset. A re-identification metric for the individual from the anonymity value is then calculated.

Type: Application

Filed: April 1, 2016

Publication date: April 26, 2018

Inventors: Martin SCAIANO, Stephen KORTE, Andrew BAKER, Geoffrey GREEN, Khaled EL EMAM, Luk ARBUCKLE

1 2 next