DATA PROCESSING SYSTEMS AND METHODS FOR AUTOMATICALLY REDACTING UNSTRUCTURED DATA FROM A DATA SUBJECT ACCESS REQUEST

Info

Publication number: 20230289376
Type: Application
Filed: Aug 6, 2021
Publication Date: Sep 14, 2023
Applicant: OneTrust, LLC (Atlanta, GA)
Inventors: Jonathan Blake Brannon (Smyrna, GA), Kevin Jones (Atlanta, GA), Saravanan Pitchaimani (Atlanta, GA), Haribalan Raghupathy (Seattle, WA), Mahashankar Sarangapani (Atlanta, GA), Mahesh Sivan (Atlanta, GA), Priya Malhotra (Atlanta, GA)
Application Number: 18/019,952

Abstract

System and methods are disclosed for redacting analyzing unstructured data in a request for data associated with a data subject to determine whether the unstructured data is relevant to the request. The relevancy of pieces of the unstructured data may be determined by determining a categorization for each such piece of unstructured data and comparing them to known personal data associated with the data subject having the same categorization. Pieces of the unstructured data that do not match known personal data having the same categorization are redacted from the request before the request is processed.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a national phase entry of PCT/US2021/044910, filed Aug. 6, 2021, which claims the benefit of U.S. Provisional Patent Application No. 63/061,894, filed Aug. 6, 2020, the entire disclosures of which are hereby incorporated herein by reference in their entirety.

BACKGROUND

Computing tools for managing sensitive data, such as data storage systems and their associated applications for modifying or accessing stored data, are often used to automatically process requests regarding how that particular data is handled. For instance, processing such requests may require these computing tools to search multiple data assets that use a variety of different data structures, storage formats, or software architectures in order to identify and action requests to access personal data, delete or otherwise modify personal data, receive information about the handling, storage, and/or processing of personal data, etc. The effectiveness of these computing tools can be degraded when resources (e.g., processing power, storage, network bandwidth) are used to service requests having extraneous information that is not useful in processing the request, especially when a request is received as an unstructured electronic communication such as an email or text message. For example, such extraneous unstructured data may not correspond to any particular data type recognized by the data storage system to which the request is directed. Devoting resources to processing such extraneous data can degrade system performance through the wasteful expenditure of resources, the provision of an inaccurate or incomplete response to the request, or both.

SUMMARY

A method, according to various embodiments, may include: receiving, by computing hardware, a request for personal data associated with a data subject, the request comprising structured data and unstructured data; retrieving, by the computing hardware, a piece of the personal data by scanning a data source using the structured data; analyzing, by the computing hardware, the unstructured data to determine a first categorization for a first piece of the unstructured data and a second categorization for a second piece of the unstructured data; mapping, by the computing hardware, the first piece of the unstructured data to the piece of the personal data based on the first categorization and the personal data categorization; mapping, by the computing hardware, the second piece of the unstructured data to the piece of the personal data based on the second categorization and the personal data categorization; determining, by the computing hardware, that the first piece of the unstructured data matches the piece of the personal data; determining, by the computing hardware, that the second piece of the unstructured data does not match the piece of the personal data; in response to determining that the first piece of the unstructured data matches the piece of the personal data and the second piece of the unstructured data does not match the piece of the personal data, generating, by the computing hardware, redacted unstructured data comprising the first piece of the unstructured data and excluding the second piece of the unstructured data from the redacted unstructured data; and processing, by the computing hardware, the request using the redacted unstructured data.

In particular embodiments, the method further comprises determining an access method for the data source; and retrieving the piece of personal data comprises retrieving the piece of personal data from the data source using the access method. In particular embodiments, the method further comprises determining a first data type identifier for the data source, determining a second data type identifier for the structured data, and determining that the first data type identifier corresponds to the second data type identifier; and retrieving the piece of personal data comprises, in response to determining that the first data type identifier corresponds to the second data type identifier, retrieving the piece of personal data from the data source using the structured data. In particular embodiments, the piece of personal data is associated with a third data type identifier that is distinct from the first data type identifier and the second data type identifier. In particular embodiments, analyzing the unstructured data comprises: determining a first confidence score for the first categorization and a second confidence score for the second categorization; determining the first categorization for the first piece of the unstructured data based on the first confidence score; and determining the second categorization for the second piece of the unstructured data based on the first confidence score. In particular embodiments, processing the request comprises: determining that the redacted unstructured data represents a portion of the unstructured data greater than a threshold; and in response to determining that the redacted unstructured data represents the portion of the unstructured data greater than the threshold, suspending processing of the request and transmitting a notification that the redacted unstructured data represents a portion of the unstructured data greater than a threshold to a user. In particular embodiments, the method further comprises retrieving a second piece of the personal data by scanning a second data source using the piece of the personal data.

A system, according to various embodiments, may include: a non-transitory computer-readable medium storing instructions; and processing hardware communicatively coupled to the non-transitory computer-readable medium, wherein the processing hardware is configured to execute the instructions and thereby perform operations comprising: receiving a request for personal data associated with a data subject, the request comprising unstructured data; retrieving a piece of the personal data stored on a data source using a personal data categorization associated with the piece of the personal data; determining a first categorization for a first piece of the unstructured data and a second categorization for a second piece of the unstructured data; mapping the first piece of the unstructured data to the piece of the personal data based on the first categorization and the personal data categorization; mapping the second piece of the unstructured data to the piece of the personal data based on the second categorization and the personal data categorization; determining that the first piece of the unstructured data corresponds to the piece of the personal data; determining that the second piece of the unstructured data does not correspond to the piece of the personal data; generating redacted unstructured data comprising the first piece of the unstructured data and excluding the second piece of the unstructured data from the redacted unstructured data; and transmitting the redacted unstructured data for use in processing the request.

In particular embodiments, determining that the first piece of the unstructured data corresponds to the piece of the personal data comprises: determining that the first piece of the unstructured data matches the piece of the personal data; determining a confidence score based on determining that the first piece of the unstructured data matches the piece of the personal data; determining that the confidence score is greater than a threshold value; and in response to determining that the confidence score is greater than the threshold value, determining that the first piece of the unstructured data corresponds to the piece of the personal data. In particular embodiments, determining that the second piece of the unstructured data does not correspond to the piece of the personal data comprises: determining that the second piece of the unstructured data matches the piece of the personal data; determining a confidence score for based on determining that the second piece of the unstructured data matches the piece of the personal data; determining that the confidence score is less than a threshold value; and in response to determining that the confidence score is less than the threshold value, determining that the second piece of the unstructured data does not correspond to the piece of the personal data. In particular embodiments, the operations further comprise retrieving a second piece of the personal data stored on a second data source by searching the second data source using the piece of the personal data. In particular embodiments, the operations further comprise: determining a third categorization for a third piece of the unstructured data; mapping the third piece of the unstructured data to a second piece of the personal data based on the third categorization and a second personal data categorization associated with the second piece of the personal data; and determining that the third piece of the unstructured data corresponds to the second piece of the personal data. In particular embodiments, the method further comprises determining an access method associated with the data source; and retrieving the piece of personal data comprises retrieving the piece of personal data from the data source using the access method. In particular embodiments, the request further comprises structured data; retrieving the piece of the personal data stored on the data source comprises searching the data source using the structured data; the structured data is associated with a first data type identifier; the piece of the personal data is associated with a second data type identifier; and the first data type identifier is distinct from the second data type identifier.

A non-transitory computer-readable medium, according to various embodiments, may store computer-executable instructions that, when executed by processing hardware, configure the processing hardware to perform operations comprising: receiving an electronic communication comprising a request for personal data associated with a data subject, the request comprising a data subject identifier and message data; retrieving, based on the data subject identifier, a piece of the personal data by scanning a data source using a personal data categorization for the piece of the personal data; analyzing the message data to determine a first categorization for a first piece of the message data and a second categorization for a second piece of the message data; mapping the first piece of the message data to the piece of the personal data based on the first categorization and the personal data categorization; mapping the second piece of the message data to the piece of the personal data based on the second categorization and the personal data categorization; determining that the first piece of the message data matches the piece of the personal data; determining that the second piece of the message data does not match the piece of the personal data; in response to determining that the first piece of the message data message data matches the piece of the personal data and the second piece of the message data does not match the piece of the personal data, generating redacted message data comprising the first piece of the message data and excluding the second piece of the message data from the redacted message data; and processing the request using the redacted message data.

In particular embodiments, the operations further comprise retrieving a second piece of the personal data by scanning a second data source using the piece of the personal data. In particular embodiments, the piece of the personal data is associated with a first data type identifier; the second piece of the personal data is associated with a second data type identifier; and the first data type identifier is distinct from the second data type identifier. In particular embodiments, the operations further comprise determining the second data source based on the first data type identifier. In particular embodiments, processing the request comprises: determining that the request was processed; based on determining that the request was processed, generating a graphical user interface for a browser application executed on a user device by configuring a first display element configured to display an indication that the request was successfully processed on the graphical user interface and excluding a second display element configured to display an indication that the request was not successfully processed from the graphical user interface; and transmitting an instruction to the browser application causing the browser application to present the graphical user interface on the user device. In particular embodiments, generating the graphical user interface comprises configuring a third display element configured to display the personal data on the graphical user interface.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of a system and method for automatically redacting unstructured data from a data subject access request are described below. In the course of this description, reference will be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:

FIG. 1 depicts an example of a computing environment for performing redaction with respect to a data subject access request.

FIG. 2 is a flow chart showing an example of a process performed by a Personal Data Discovery and Identity Graph Generation Module according to various embodiments.

FIG. 3 is a diagram illustrating a representation of an exemplary identity graph and associated metadata according to various embodiments.

FIG. 4 is a flow chart showing an example of a process performed by an Automatic Unstructured Data Redaction Module according to various embodiments.

FIG. 5 is a diagram illustrating representations of exemplary data structures that may be used by systems and methods for automatically redacting unstructured data according to various embodiments.

FIG. 6 is a diagram illustrating an exemplary network environment in which the various systems and methods for automatically redacting extraneous information and/or unstructured data may be implemented.

FIG. 7 is a schematic diagram of a computer that is suitable for use in various embodiments.

DETAILED DESCRIPTION

Various embodiments now will be described more fully hereinafter with reference to the accompanying drawings. It should be understood that the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Like numbers refer to like elements throughout.

Overview

In various embodiments, an unstructured data redaction system may be configured to dynamically determine whether one or more pieces of data included in a request for data associated with a particular data subject (e.g., a data subject access request (DSAR), a consumer rights request, etc.) are relevant to the request. The unstructured data redaction system may analyze the request using an identity graph representing personal data associated with the data subject to identify pieces of data in the request that are relevant to the request (e.g., associated with the data subject) and pieces of data that are not relevant to the request (e.g., not associated with the data subject). The unstructured data redaction system may then redact pieces of data that are not relevant and process the request using the relevant data.

To generate an identity graph that represents a data subject's personal data, an exemplary unstructured data redaction system may be configured to search various data sources using pieces of personal data associated with the data subject. The unstructured data redaction system may scan each of the data sources using the data subject's personal data to discover and correlate data type identifiers associated with the identified personal data with that data subject. Using this information, the unstructured data redaction system may generate an identity graph of the user's personal data. The identity graph may include a mapping of the personal data that is stored or otherwise handled at each data source and the means by which such personal data may be accessed at each data source. The identity graph may be stored as metadata along with the data type identifiers that are used with the particular data source to access the personal data stored on the data source. Such data type identifiers may indicate a classification and/or categorization for the personal data (e.g., telephone number, home address, postal code, name, etc.).

When a request for data associated with a data subject is received, the unstructured data redaction system may parse the information in the request to classify and/or categorize pieces of data in the request. The unstructured data redaction system may use the identity graph generated for the data subject's personal data to determine and/or retrieve (e.g., all available) personal data associated with the user and the associated data type identifiers for each piece of such personal data. The unstructured data redaction system may map the categorized request data to the personal data associated with the data subject based on the data type identifiers associated with such personal data. The unstructured data redaction system may then compare the categorized request data to the corresponding personal data to determine whether, or to what extent, each piece of the request data matches the known personal data for the user. The quality of a data match may be determined based on a correlation of the request data values and the known personal data values. In particular embodiments, the quality of a data match may be further determined based on a confidence score of the known personal data (e.g., retrieved from the data sources using the graph). The unstructured data redaction system may then discard or otherwise redact those pieces of data in the request that do not match (e.g., sufficiently match) known personal data associated with the data subject as determined using the identity graph. The unstructured data redaction system may then process the request using the unredacted request data.

FIG. 1 illustrates an exemplary computing environment in which unstructured data may be redacted from a DSAR and the DSAR may be processed using the remaining (e.g., relevant) data. A DSAR processing system 110 may generate, at a DSAR generation module 115, a DSAR 111 in response to, for example, a request from a data subject. The DSAR 111 may be a request to perform any data request actions as described herein. The DSAR 111 may include unstructured data. The DSAR generation module 115 may provide the DSAR 111 to a personal data discovery and correlation module 120 of an unstructured data redaction system 150 that may use the DSAR 111 to generate an identity graph 121 representing the personal data associated with the data subject that requested the DSAR 111. The personal data discovery and correlation module 120 may search various data sources 125 using pieces of personal data associated with the data subject that the personal data discovery and correlation module 120 may determine, for example, from the DSAR 111. The personal data discovery and correlation module 120 may scan the data sources 125 using the data subject's personal data to discover and correlate data type identifiers associated with the identified personal data with that data subject. Using this information, the personal data discovery and correlation module 120 may generate the identity graph 121 of the user's personal data. The identity graph 121 may include a mapping of the personal data that is stored or otherwise handled at each of the data sources 125 and the means by which such personal data may be accessed at each such data source. The identity graph 121 may be stored as metadata along with the data type identifiers that are used with the particular data source of the data sources 125 to access the personal data stored on the particular data source. Such data type identifiers may indicate a classification and/or categorization for the personal data (e.g., telephone number, home address, postal code, name, etc.).

An automatic unstructured data redaction module 140 may parse the information in the DSAR 111 to classify and/or categorize pieces of data in the DSAR 111. The automatic unstructured data redaction module 140 may use the identity graph 121 generated for the data subject's personal data to determine and/or retrieve (e.g., all available) the personal data associated with the user from the data sources 125 and the associated data type identifiers for each piece of such personal data. The automatic unstructured data redaction module 140 may map the categorized data from the DSAR 111 to the personal data associated with the data subject based on the data type identifiers associated with such personal data. The automatic unstructured data redaction module 140 may then compare the categorized data from the DSAR 111 to the corresponding personal data to determine whether, or to what extent, each piece of the categorized data from the DSAR 111 matches the known personal data for the user. As described in more detail herein, the quality of a data match may be taken into account in determining the particular pieces of the categorized data from the DSAR 111 are relevant. The automatic unstructured data redaction module 140 may then discard or otherwise redact those pieces of categorized data from the DSAR 111 that do not sufficiently match known personal data associated with the data subject as determined using the identity graph 121. The automatic unstructured data redaction module 140 may then provide the redacted DSAR 141 to a DSAR processing module 116 of the DSAR processing system 110 for processing using the unredacted request data.

Personal Data Discovery and Identity Graph Generation Systems and Methods

As noted herein, an entity that handles (e.g., collects, receives, transmits, stores, processes, shares, etc.) sensitive and/or personal information associated with particular individuals (“personal data”) may be subject to various laws and regulations regarding the handling of such personal data. The applicable laws and regulations may vary based on the jurisdiction in which the entity is operating, the jurisdiction in which the individual associated with the personal data (“data subject”) is located, and/or the jurisdiction in which the personal data is handled. In many jurisdictions, an entity that handles personal data may be required to track the personal data they handle, by maintaining and/or readily generating information that indicates where the personal data is stored, how the personal data is processed, how the personal data is collected, etc. The entity may be required to have this information available (or have to ability to obtain this information) so that it can readily service data subject access requests (DSARs). As noted above, a DSAR may be a request from a data subject or other user to access personal data, delete personal data, receive information about the handling of personal data, etc. The entity may also, or instead, be required to have this information available (or have to ability to obtain this information) to comply with various aspects of applicable laws, regulations, and/or standards. The entity may also, or instead, want to be able to have this information available (or have to ability to obtain this information) to perform other functions, such as mine legacy systems for personal data (e.g., to ensure that legacy systems comply with current laws, regulations, and/or standards), create maps of where personal data may be stored, identify personal data that may need to be modified (deleted due to age or other factors, updated, supplemented, etc.), generate identity graphs representing personal data associated with a particular data subject, and/or perform unstructured data redaction functions in processing requests for data.

As the quantity of personal data increases over time, and as the number of systems that may possibly be handling personal data increases, determining how particular personal data has been handled (e.g., collected, received, transmitted, stored, processed, shared, etc.) across all of the potential systems that may have handled such personal data can be difficult. Discovering particular personal data cross multiple systems may become even more challenging when each of the systems may use its own, possibly unique, method of identifying the data subject associated with the particular personal data. Where different means of identifying a data subject are used across multiple systems, locating personal data associated with a particular data subject may not be feasible by simply using a name or other single piece of information associated with the particular data subject.

In various embodiments, the unstructured data redaction system may connect to data sources that handle personal data for a particular data subject. Such data sources may include, but are not limited to, file repositories (structured and/or unstructured), data repositories, databases, enterprise applications, mobile applications (“apps”), cloud storage, local storage, and/or any other type of system that may be configured to handle personal data. The unstructured data redaction system may analyze some or all of the data stored on the data sources to determine whether such data includes pieces of personal data. If so, the unstructured data redaction system may label or otherwise store an indication that the personal data stored on the data sources as personal data. The unstructured data redaction system may then record the location of each of the pieces of personal data and/or the location of each of the data sources on which each of the pieces of personal data were discovered. The unstructured data redaction system may also record the manner of identification used to identify each of the pieces of personal data. The unstructured data redaction system may store any such information as metadata. This personal data information may then be used when the unstructured data redaction system needs to locate the particular personal data, for example, to respond to a request for data, generate an identity graph representing the particular personal data, and/or perform unstructured data redaction in processing a request for data. The unstructured data redaction system may also, or instead, use such personal data information to comply with various requirements (e.g., legal, regulatory, standards, etc.), to mine legacy systems for personal data, to create a map of where personal data may be stored, to identify personal data that may need to be modified, etc.

In analyzing the data on various data sources, the unstructured data redaction system may determine whether a particular piece of personal data on a first data source corresponds to a particular piece of personal data on a second data source using various methods. For example, the unstructured data redaction system may compare the pieces of personal data (e.g., text string comparison) to determine if they are the same. In particular embodiments, the unstructured data redaction system may compare the data type identifiers of the pieces of personal data to determine if they correspond. In particular embodiments, the unstructured data redaction system may use artificial intelligence, big data methods, and/or neural networks to perform more sophisticated analysis to determine whether the particular pieces of personal data correspond to one another. For example, in some embodiments, two particular pieces of personal data may not have similar data type identifiers and/or may be stored in different formats but may actually both represent a same type of personal data (e.g., email address, telephone number, name, etc.). In such embodiments, the unstructured data redaction system may use artificial intelligence, machine learning, neural networking, big data methods, natural language processing, contextual awareness, and/or continual learning (in any combination) to identify particular pieces of personal data and/or to determine whether and how particular pieces of personal data match up to one another. Once a piece of personal data is identified and/or matched with one or more other pieces of personal data, the unstructured data redaction system may store information reflecting the identification and/or matching in metadata for future use, including as described herein.

In a particular embodiment, the unstructured data redaction system may tag (e.g., in metadata) particular pieces of personal data with an indicator that indicates that the respective particular piece of personal data can be used to query its data source associated with that particular piece of personal data (e.g., a “queryable” tag). The unstructured data redaction system may also, or instead, tag (e.g., in metadata) fields associated with personal data storage at a particular data source with an indicator that indicates that the respective field may contain data that can be used to query that data source (e.g., a “queryable” tag). The unstructured data redaction system may then use such a tag in future attempts to locate particular personal data, for example, stored in a particular data source.

FIG. 2 illustrates an example process that may be performed by a Personal Data Discovery and Identity Graph Generation Module 200. At Step 210, a particular user may submit a DSAR requesting a copy of the personal data associated with the particular data subject indicated by the DSAR. The DSAR may include the particular data subject's first name, last name, and email address. While this example uses a DSAR requesting a data subject's personal data, in various embodiments the unstructured data redaction system may locate particular personal data in response to a need to, for example, comply with various requirements (e.g., legal, regulatory, standards, etc.), mine legacy systems for personal data, create a map of where personal data may be stored, to identify personal data that may need to be modified, proactively generate an identity graph, etc.

At Step 220, using the information included in the DSAR, such as personal data, the unstructured data redaction system may identify a particular data subject associated with the DSAR (who may or may not be the user that submitted the DSAR). At Step 230, the unstructured data redaction system may identify data sources that store personal data. In particular embodiments, the unstructured data redaction system may identify data sources that store personal data generally and scan (e.g., all) of such data sources for personal data associated with the particular data subject as described in more detail below. Alternatively, the unstructured data redaction system may determine a subset of the data sources that store personal data generally for scanning based on, for example, information in the DSAR. For example, the unstructured data redaction system may determine that the DSAR is a request for a specific type of information (e.g., billing, financial, healthcare, etc.) and may determine a subset of data sources that store personal data associated with that specific type of information. In another example, the unstructured data redaction system may determine that the DSAR is a request from a specific type of data subject (e.g., customer, subscriber, member, etc.) and may determine a subset of data sources that store personal data associated with that specific type of data subject. The unstructured data redaction system may also, or instead, use any other means of determining a particular set of data sources for personal data scanning.

The unstructured data redaction system, according to this particular example, may have access to personal data associated with the particular data subject stored in two separate data sources. The first data source may be a customer database that stores the username of the particular data subject, along with the particular data subject's email address, first name, last name, social security number, postal code (e.g., zip code), and street address. The first data source may (e.g., only, or most efficiently) be searchable by email address. The second data source may be a certified drivers database that stores the particular data subject's driver's license record and social security number. The second data source may (e.g., only, or most efficiently) be searchable by social security number. In this example, if an initial search was executed against these two data sources using the information provided in the DSAR (the particular data subject's first name, last name, and email address), only the first data source would return results because only it may be searched using an email address, whereas the second data source may not be searchable using an email address.

In various embodiments, the unstructured data redaction system may record metadata that correlates (or may be used to correlate) the data in the two data sources. For example, at Step 240, the unstructured data redaction system may scan the first data source using the email address provided by the DSAR. The unstructured data redaction system may determine to scan the first data source by determining that the first data source is searchable using personal data or other information that may be associated with the particular data subject that was included in the DSAR. The unstructured data redaction system may obtain or identify, in response to the scan, first additional personal data associated with the particular data subject stored on the first data source. For example, the unstructured data redaction system may obtain, via the scan of the first data source, the particular data subject's username, email address, first name, last name, social security number, postal code, and street address as stored on the first data source.

At Step 250, using a piece of personal data obtained from the first data source, such as the particular data subject's social security number, the unstructured data redaction system may scan the second data source to obtain or identify second additional personal data associated with the particular data subject stored on the second data source. For example, the unstructured data redaction system may obtain, via the scan of the second data source, the particular data subject's driver's license information (e.g., driver's license number).

At Step 260, the unstructured data redaction system may perform a check to determine whether the first additional personal data and the second additional personal data correspond to the particular data subject. For example, the unstructured data redaction system may compare the information received from each of the two data sources to verify that it is consistent and appears to correspond to the particular data subject (e.g., pieces of personal data of the same type have the same value or substantially similar values and/or are associated with pieces of personal data known to be associated with the particular data subject). At Step 270, the unstructured data redaction system may store a record (e.g., in metadata) of the particular personal data information (e.g., personal data, type of personal data) stored in each of the two data sources and how to access such information at each data source for the particular data subject.

FIG. 3 illustrates an exemplary identity graph 300 and table 390 representing examples of data structures that may be used in the various embodiments. The identity graph 300 is a diagrammatic representation of an identity graph representing personal data associated with a data subject 350. The table 390 represents metadata associated with the identity graph 300.

In various embodiments, the unstructured data redaction system may scan data sources that may store personal data and generate a graph for each data source. In this example, the unstructured data redaction system may scan each of the data sources 310, 320, and 330 to generate graphs 301, 302, and 303, respectively. Each of graphs 301, 302, and 303 may also be referred to as a node of the identity graph 300. Each such graph may include a mapping of the personal data that is stored or otherwise handled at the respective data source and the means by which such personal data may be accessed. The graphs may be stored as metadata along with the data type identifiers identifying the types of data that may be searched for in the particular data source to access the personal data stored on the data source. For example, if a data source has a “telephone number” data type identifier, telephone numbers may be searched in the data source to retrieve records that may include other types of data that are not searchable on that data source (e.g., search for a particular telephone number and if a match is found, data associated with the particular telephone number may be retrieved because it is linked to the telephone number, such as name, address, email, etc.).

In this example, the table 390 illustrates the metadata associated with each data source graph, or node, of identity graph 300. As can be seen in this figure, the data source 310 stores email, telephone numbers, addresses, and names, is searchable (e.g., queryable) using telephone numbers (e.g., telephone numbers may be searched on data source 310 to retrieve other data associated with a telephone number), and is accessible using the particular access method list for the data source 310 in the table 390. The access method listed in the table 390 may be a particular query that may be used to retrieve data from a data source or a reference (e.g., indicator, pointer, identifier, etc.) thereto. Alternatively, the access method listed in the table 390 may be a query template, a script, and/or any other means of accessing data at a data source or a reference (e.g., indicator, pointer, identifier, etc.) thereto.

As the unstructured data redaction system scans various data sources, the unstructured data redaction system may discover new data type identifiers for a particular data subject and/or personal data associated with the particular data subject. The unstructured data redaction system may use a dependency graph to indicate which data sources use data type identifiers that may be obtained from other data sources, thus creating a record of the interrelation of the various data sources. The unstructured data redaction system may store these new data type identifiers in metadata and determine whether there are any other data sources in scope (e.g., that use the same data type identifier). For those data sources that do not use the same data type identifier, the unstructured data redaction system may use another data type identifier determined from another data source to access the personal data for the particular data subject in those data sources.

For example, referring again to FIG. 3, the unstructured data redaction system may receive or determine a particular data subject's telephone number (e.g., in a request for data associated with the particular data subject) and may scan the data source 310 using the telephone number as the data type identifier and the access method associated with the data source 310. The unstructured data redaction system may discover that the data source 310 also stores email, addresses, and names. The unstructured data redaction system may store data type identifiers for this data and use those data type identifiers to scan other data sources that are not searchable by telephone number but may be searchable by the data type identifiers discovered at the data source 310. For example, having obtained the particular data subject's email address from the scan of the data source 310, the unstructured data redaction system may then scan the data sources 320 and 330 (which may be searchable by email but not by telephone number) using the email address as the data type identifier. These scans may result in the discovery of additional data associated with the particular data subject, as shown in the table 390.

As will be appreciated, the various disclosed embodiments may facilitate, based on a single piece of a particular data subject's data, the discovery of many types of data associated with a particular data subject in a variety of data sources that may each use different types of data type identifiers and different means of access. The unstructured data redaction system may use the other types of data discovered in a first data source using a first type of data to scan a second set of data sources that may not be searchable with the first type of data. The unstructured data redaction system may add a node to the identity graph for each data source in the second set of data sources in which the unstructured data redaction system identified data associated with a particular data subject. The unstructured data redaction system may then scan yet a third set of data sources using data discovered in the second set of data sources and add nodes for data sources in the second set of data sources to the identity graph as data associated with the particular data subject is identified. The unstructured data redaction system may execute this process iteratively until the available data sources have all been scanned and a complete identity graph has been generated that can be used to efficiently perform other functions, such as automatically redacting unstructured data from a request for data.

In various embodiments, the unstructured data redaction system may generate identity graphs according to the disclosed embodiments at any time. For example, the unstructured data redaction system may generate an identity graph for any new, or newly detected, data subject in response to the detection of the new data subject. Alternatively, or in addition, the unstructured data redaction system may generate an identity graph associated with a data subject's personal data in response to receiving a request from data associated with that data subject, for example, before processing the request. The unstructured data redaction system may also, or instead, modify any such graphs in response to an event (e.g., detection of personal data modification on a data source, detection of the additional and/or removal of a data source, etc.) or on a recurring (e.g., periodic) basis. The unstructured data redaction system may also, or instead, delete any such graphs in response to an event (e.g., detection of the removal of personal data from a data source, detection of the removal of data sources associated with the graph, etc.), or in response to determining that the graph is no longer in use (e.g., unused for at least a pre-determined period of time).

Systems and Methods for Automatically Redacting Unstructured Data from a Data Request

As described herein, an entity that handles personal data associated with a particular data subject may receive data requests (e.g., DSARs) from, or on behalf of, the data subject. Each such request may be a request to access, delete, retrieve, and/or modify personal data associated with the data subject. Each such request may also, or instead, be a request for information about the manner in which the entity handles, stores, and/or processes the personal data associated with the data subject.

Often such requests may take the form of, or may be provided via, electronic communications such as emails, chats, texts, or documents containing unstructured data (e.g., data for which data types and/or associations are not indicated). Such requests may include information that is not relevant to or useful in processing the request. Such requests may also, or instead, include information that is associated with personal data of users other than the data subject associated with the request. It can be challenging to separate the useful (e.g., for purposes of processing the request) information in a request from information that is not useful. For example, an email associated with a request may include names, telephone numbers, email addresses, and/or home addresses of several people (e.g., in an email string) while the request is related to only the personal data associated with a single particular data subject. The disclosed unstructured data redaction systems and methods provide means of automatically and efficiently redacting such extraneous information from a request while retaining the relevant personal data associated with a particular data subject for processing the request.

In various embodiments, and as described in more detail above, the unstructured data redaction system may be configured to generate an identity graph for a data subject's personal data using pieces of the personal data to search across various data sources. The unstructured data redaction system may use pieces of personal data associated with a data subject to search across various data sources to discover and correlate associated data type identifiers with the particular data subject. For example, the unstructured data redaction system may use a known piece of information for a data subject (e.g., a first name, a last name, an account number, an email address, a telephone number, a username, an IP address, etc.) to identify other pieces of information from a data source associated with that known piece of information. The unstructured data redaction system may then correlate those identified other pieces of information with the data subject and store (e.g., in metadata) such correlation information to generate an identity graph for the data subject. The identity graph may include a mapping of the personal data (e.g., types of personal data, categories of personal data, etc.) that is stored or otherwise handled at each data source and the means by which such personal data may be accessed. The graph may be stored as metadata including the data type identifiers that are used with the particular data source to access the types of personal data stored on the data source for the data subject. Such data type identifiers may indicate a classification or category for the data (e.g., telephone number, home address, postal code, name, etc.). The graph can then be used to retrieve any information identified in the graph from the data sources as needed, for example to process a request for data and/or to redact extraneous data from such a request as described herein.

In response to receiving a request for data (e.g., DSAR, consumer rights request, etc.) from, or on behalf of, a particular data subject, the unstructured data redaction system may use an identity graph associated with the particular data subject's personal data to determine and/or retrieve (e.g., all or any portion of) the available personal data associated with the data subject.

Further in response to receiving the data request, the unstructured data redaction system may analyze the information in the request to classify and/or categorize each piece of information in the request. In classifying and/or categorizing each such piece of information, the unstructured data redaction system may assign each piece of information a data type identifier, for example, selected from the data type identifiers that may potentially be used to categorize pieces of personal data identified in identity graphs as described herein. In particular embodiments, the unstructured data redaction system may use natural language processing (NLP), machine learning, neural networks, and/or any other advanced processing techniques to identify and categorize information in a request. The unstructured data redaction system may also assign a confidence score to the categorization of each piece of information in a request using various techniques (e.g., 70% confident a particular piece of information is a postal code, 80% confident a particular piece of information is a telephone number, etc.). Categorizations may be associated with a type of information determined for each piece of information in the request (e.g., email, address, first name, last name, postal code, telephone number, etc.).

After the request information has been categorized and the data subject's personal data has been retrieved using an identity graph, the unstructured data redaction system may map pieces of the categorized request information to pieces of the data subject's personal data based on the data type identifiers and categories associated with each such piece of information. For example, the unstructured data redaction system may determine that a piece of request information appears to be a telephone number and may therefore categorize that piece of request information as a “telephone number.” The unstructured data redaction system may then match that piece of request information to a retrieved piece of personal data having a “telephone number” data type identifier as indicated in the data subject's graph to determine a data pairing that may then be compared as described below.

The unstructured data redaction system may then determine, for each pairing of a piece of request data with a piece of retrieved personal data that have matching categories/data type identifiers, whether the piece of request data matches the piece of retrieved personal data. For example, where the unstructured data redaction system has determined a pairing of a piece of request information categorized as a “telephone number” with a retrieved piece of personal data having a “telephone number” data type identifier, the unstructured data redaction system may then determine whether the telephone numbers represented by these pieces of information are the same telephone number. Various techniques may be used to determine whether the pieces of data match, including a strict character string match, NLP-based matching, and/or any other data matching techniques or combinations thereof. The unstructured data redaction system may determine that those pieces of information from the request that match a piece of personal data associated with the data subject are relevant to the request and may be used in processing the request.

The unstructured data redaction system may determine whether the data associated with each piece of information in a pairing is a match based on the correlation of the data values in combination with other criteria. For example, the unstructured data redaction system may consider a confidence score for the piece of information that was detected by analyzing the request and/or the piece of information retrieved from a data source using an identity graph. Other criteria may also be used. The unstructured data redaction system may be configured to calculate a match score for each pairing (e.g., 100% match, 75% match, etc.) and determine that a pairing constitutes a match when the respective match score meets or exceeds a threshold (e.g., 70%, 80% etc.).

In response to determining whether each piece of information in a request is relevant, the unstructured data redaction system may then discard, redact, or otherwise ignore the pieces of request information that do not match personal data associated with the data subject. The unstructured data redaction system may then process the request using the unredacted data included in the request, and, in particular embodiments, data retrieved using identity graphs as described herein.

FIG. 4 shows an example process that may be performed by an Automatic Unstructured Data Redaction Module 400. In executing the Automatic Unstructured Data Redaction Module 400, the unstructured data redaction system begins at Step 410 where the unstructured data redaction system receives a request for data associated with a particular data subject, such as a DSAR. This request may take the form of a message or other electronic communication that may include message data that includes unstructured data. For example, the request may be an email that includes message data (e.g., email body) that is unstructured data. In another example, the request may be a text message that includes message data (e.g., text message content) that includes unstructured data. Using information in or associated with the request, at Step 420 the unstructured data redaction system may identify and retrieve an identity graph associated with the particular data subject's personal data. For example, the request may include structured data fields that may be populated with a data subject's name, email address, telephone number, and/or other data type identifiers (e.g., user name, IP address, account number, member number, etc.). The unstructured data redaction system may use this structured data to identify the particular data subject and retrieve the identity graph associated with the particular data subject's personal data.

At Step 430, the unstructured data redaction system may use the identity graph associated with the particular data subject's personal data to determine and/or retrieve (e.g., all or any portion of) the available personal data associated with the data subject from the data sources represented in the identity graph, for example, using the access methods and/or data type identifiers indicated for each data source in its respective identity graph node. In various embodiments, the unstructured data redaction system may use the structured data from the request to search those data sources that are indicated in the identity graph as being searchable using the type of data associated with the structured data. For example, when the request includes a structured telephone number data field, the unstructured data redaction system may use the value of this field to search a first data source that is searchable by telephone number. Using the results of this initial search (e.g., an email address), the unstructured data redaction system may then search a second data source that is searchable by email address but not telephone number. If there is a third data source that is not searchable by either telephone number or email address, the unstructured data redaction system may use the results of searches of the first and second data sources to further search this third data source, using the results of that search to search subsequent data sources and so on, until the available personal data associated with the data subject has been retrieved from the data sources indicated in the identity graph.

At Step 440, the unstructured data redaction system may analyze the unstructured data in the request to assign a category and/or classification to each piece of information in the unstructured data portion of the request. In classifying and/or categorizing each such piece of information, the unstructured data redaction system may assign each piece of information a data type identifier, for example, selected from the identifiers of types of personal data that may be used by the unstructured data redaction system to categorize pieces of personal data (e.g., as identified in identity graphs as described herein). Examples of such data type identifiers include but are not limited to, email, address, first name, last name, postal code, telephone number, etc. In particular embodiments, the unstructured data redaction system may use data type identifiers that indicate that a piece of unstructured data is irrelevant, for example, the conversational text within a request. In other embodiments, the unstructured data redaction system may assign a data type identifier associated with a particular type of personal data to every piece of unstructured data and rely on a low confidence score to indicate that a particular piece of unstructured data is irrelevant (e.g., not a good match for a data type identifier, such as conversational text). As noted, the unstructured data redaction system may use NLP, machine learning, neural networks, and/or any other advanced processing techniques to identify and categorize each piece of information in the unstructured data portion of the request.

At Step 450, the unstructured data redaction system may determine a confidence score for each categorization and/or classification assigned to each piece of information in the unstructured data portion of the request. The unstructured data redaction system may use any of various techniques to determine a confidence score, such as machine learning and NLP, in particular embodiments, integrating human feedback as described in more detail below. In particular embodiments, a confidence score may have a (e.g., numerical) value that may be compared to a threshold value (e.g., 70% confident a particular piece of information is a postal code, 80% confident a particular piece of information is a telephone number, 50% confident a particular piece of information is message text, etc.). In particular embodiments, a confidence score may indicate that a piece of unstructured data is irrelevant regardless of the categorization and/or classification. For example, the unstructured data redaction system may classify conversational text within a request as “names” because it is made up of character strings, but because such text has no other attributes of the “names” classification, the unstructured data redaction system may assign a very low or zero confidence score to such unstructured data.

At Step 460, the unstructured data redaction system may map eligible pieces of the categorized request information to pieces of the data subject's personal data based on the data type identifiers and categories associated with each such piece of information. The unstructured data redaction system may discard or ignore those pieces of the request information that remain uncategorized, have too low a confidence score, or are categorized as being data ineligible for mapping to the data subject's personal data. For example, the unstructured data redaction system may have determined that a piece of request information appears to be a telephone number and may have therefore categorized that piece of request information as a “telephone number.” The unstructured data redaction system may then match that piece of request information to a retrieved piece of personal data having a “telephone number” data type identifier (e.g., as indicated in the data subject's identity graph) to determine a data pairing that may then be compared as described below.

At Step 470, the unstructured data redaction system may compare each piece of categorized unstructured data from the request to the piece of retrieved personal data to which it is mapped to determine whether the pieces of data match. For example, where the unstructured data redaction system has paired a piece of unstructured data categorized as a “telephone number” with a retrieved piece of personal data having a “telephone number” data type identifier, the unstructured data redaction system may then determine whether the telephone numbers represented by these pieces of information are the same telephone number. As noted above, various techniques may be used to determine whether the pieces of data match.

In particular embodiments, the unstructured data redaction system may use a confidence or match score for a match and determine whether the pieces of data in a pair match based on the score. For example, the unstructured data redaction system may compare a confidence or match score for a data pairing (e.g., 50%75%, 90%, etc.) against a threshold confidence score (e.g., 75%, 85%, etc.) and determine that the data in the pairing matches if the confidence score meets or exceeds the threshold. The unstructured data redaction system may also, or instead, take into account a confidence score for one or both pieces of data in a pairing (e.g., a confidence score for the categorization of a piece of unstructured data and/or a confidence score for the categorization of a piece of personal data retrieved from a data source using an identity graph). Other criteria may also be used.

At Step 480, the unstructured data redaction system may discard, redact, or otherwise ignore the pieces of unstructured data from the request that do not match personal data associated with the data subject and are therefore irrelevant to the request. Further at Step 480, the unstructured data redaction system may then process the request using the unredacted unstructured data included in the request, and, in particular embodiments, data retrieved using identity graphs as described herein.

A simplified example illustrating the operation of an unstructured data redaction system with reference to various exemplary data structures will now be described. FIG. 5 illustrates exemplary data structures and operations 500, including a representation of an exemplary DSAR 510. The DSAR 510 may include a structured data field 511 and unstructured data 512. The unstructured data redaction system may determine a data type identifier for the data subject associated with the DSAR 510 based on the value in the structured data field 511, in this example, an email address. Using the determined data subject identifier, at operation 501 the unstructured data redaction system may identify and retrieve personal data associated with the data subject 530, in particular embodiments using an identity graph associated with the data subject as described herein. Further at operation 501, the unstructured data redaction system may analyze and categorize the unstructured data 512 to generate categorized unstructured data 520. During this process, the unstructured data redaction system may discard those portions of the unstructured data 512 that are not eligible for matching with the data subject's personal data (e.g., message data such as email text or text message content that does not contain categorizable potentially relevant data).

At operation 502, the unstructured data redaction system may map pieces of the categorized unstructured data 520 to pieces of the retrieved personal data 530 based on the respective categorizations of each piece of data (e.g., as described herein). For example, as shown in this figure, the data categorized as telephone numbers in the categorized unstructured data 520 is mapped to the data categorized as a telephone number in the retrieved personal data 530, the data categorized as names in the categorized unstructured data 520 is mapped to the data categorized as a name in the retrieved personal data 530, and the data categorized as email addresses in the categorized unstructured data 520 is mapped to the data categorized as an email address in the retrieved personal data 530.

At operation 503, the unstructured data redaction system determines whether each piece of the categorized unstructured data 520 match the piece of retrieved personal data 530 to which it is mapped and redacts those pieces of the categorized unstructured data 520 that do not match a piece of retrieved personal data 530 to generate the redacted unstructured data 540. The unstructured data redaction system generates the relevant unstructured data 550 using the redacted unstructured data 540 and provides the relevant unstructured data 550 use in processing the DSAR 510 at operation 504.

In various embodiments, the unstructured data redaction system may be configured to perform functions that improve the performance and accuracy of categorization and/or classification determinations and the matching process. In particular embodiments, the unstructured data redaction system may generate a graphical user interface configured with presentation elements that present categorization, classification, and/or matching data to allow the user to review the results of the processes that generated this data. The unstructured data redaction system may further configure user input elements on the graphical user interface to allow a user to provide input regarding the presented data. Alternatively, or in addition, the unstructured data redaction system may configure navigation elements configured to trigger the generation of a subsequent graphical user interface that may allow the user to provide input regarding the presented data. Such interfaces may improve request data relevance determinations by gathering feedback that the unstructured data redaction system may use to improve the categorization and/or classification of data received in requests and the matching process.

In a particular example, the unstructured data redaction system may present a graphical user interface to the user indicating the pairings of potential matches that the unstructured data redaction system has identified and associated confidence levels for each pairing. In particular embodiments, the confidence levels that may be used include, but are not limited to: (1) a confidence level that a piece of information identified from a request is a particular type of information; (2) a confidence level that a piece of information retrieved from a data source using an identity graph is associated with a particular data subject; and (3) a confidence level that a piece of information identified from a request matches a piece of information retrieved from a data source using an identity graph. In particular embodiments, the unstructured data redaction system may present all such pairings to the user, while in other embodiments, the unstructured data redaction system may present only pairings that are associated with a confidence below or above a certain threshold. The unstructured data redaction system may prompt the user for input via a user input control element configured on the graphical user interface indicating whether such potential categorizations and/or matches are accurate. The unstructured data redaction system may use such information to further refine the various disclosed embodiments.

In various embodiments, the unstructured data redaction system may automatically use information from those matches having a confidence level above a certain threshold in processing the request. Also, or instead, the unstructured data redaction system may automatically exclude information from those pairings having a confidence level below a certain threshold from use in processing the request.

In various embodiments, the unstructured data redaction system may generate notifications when particular outlier or unusual events occur in analyzing a request. For example, if the unstructured data redaction system determines that the redacted portion of unstructured data in a request exceeds a certain threshold percentage of the total amount of unstructured data or the total amount of request content, the unstructured data redaction system may not process that request at all. In such cases, the unstructured data redaction system may flag the request and/or transmit the request to a user for manual review before processing. In such cases the unstructured data redaction system may also, or instead, inform the data subject (or the user submitting the request on behalf of the data subject) that the request was not processed. For example, a request with 99% of its content redacted may indicate a problematic request, as opposed to a request having only 20% of its content redacted.

Technical Contributions of Various Embodiments

An entity that handles (e.g., collects, receives, transmits, stores, processes, shares, and/or the like) sensitive and/or personal information associated with particular individuals (e.g., personally identifiable information (PII) data, sensitive data, personal data, etc.) may receive data requests from users for information relating to personal data associated with a data subject and/or requests to modify and/or delete such personal data, for example, as a data subject access request (DSAR). Because an entity may have many systems of many different types that handle personal data in various ways, processing a data request may require significant resources to locate, retrieve, and/or modify personal data based on a data request. Processing a data request becomes even more challenging when it includes extraneous data that is unrelated to the request. Such extraneous data reduces the entity's ability to efficiently process data requests by utilizing resources for processing data unnecessarily. Moreover, requests with such extraneous data may be processed improperly due to the potential confusion of relevant data in the request with irrelevant data unrelated to the request. For example, a request may include two or three unrelated telephone numbers along with a telephone number of the data subject associated with the request. This is especially an issue with data requests that include unstructured data (e.g., data for which data types and/or associations are not indicated, such as email body text, text message content, etc.), which is increasingly common as users gravitate towards simpler methods of submitting data requests. In conventional systems, such extraneous data is either processed, resulting in inefficient an unnecessary usage of system resources, or manually redacted in a time-consuming and human resource-intensive operation.

Accordingly, various embodiments of the present disclosure overcome many of the technical challenges associated with processing data requests that include unstructured data. More particularly, various embodiments of the present disclosure include implementing a limited set of rules in a process for automatically redacting irrelevant data from unstructured data included in a data request before processing the request. The various embodiments of the disclosure are directed to a computational framework configured for categorizing unstructured data in a data request, comparing the unstructured data to known personal data based on the data type, and redacting that unstructured data that does not match personal data of the same data type. Specifically, the unstructured data redaction system discovers personal data as described herein to locate (e.g., all) available personal data across various data sources and identify a data type for each such piece of data. The unstructured data redaction system generates an identity graph representing the personal data, with each node of the identity graph indicating a particular data source, the types of personal data stored at that data source, the method of accessing that data source, and the data type identifier that may be used at that data source. In response to receiving or detecting a request for data that includes unstructured data, the unstructured data redaction system analyzes the unstructured data to determine categorizations for each piece of such data. The unstructured data redaction system retrieves known personal data using the identity graph and compares the categorized pieces of unstructured data to the pieces of known personal data having the same categorization to determine whether the pieces match. Those pieces of unstructured data that match personal data are permitted to remain in the request while the pieces that do not match personal data are redacted from the request. The request including only unredacted unstructured data can then be processes much more efficiently than a request that would have included all the unstructured data. By automatically redacting irrelevant unstructured data from a data request, the various embodiments represent a significant improvement to existing and conventional processes for addressing data requests that include extraneous data.

Accordingly, various embodiments of the disclosure provided herein are more effective, efficient, accurate, and faster in determining the appropriate information to retain in a data request when the original request includes unstructured data. The various embodiments of the disclosure provided herein provide improved means of redacting irrelevant unstructured data from a data request by locating personal data across multiple data sources using a generated identity graph, categorizing unstructured data in a data request, and redacting irrelevant unstructured data from the data request based on a comparison of the categorized unstructured data to the personal data. This is especially advantageous when an entity receives many data requests in a variety of forms from many users and data subjects. In facilitating the efficient redaction of irrelevant unstructured data from data requests, the various embodiments of the present disclosure make major technical contributions to improving the computational efficiency and reliability of various privacy management systems and procedures for data request processing. This in turn translates to more computationally efficient software systems.

Example Technical Platforms

As will be appreciated by one skilled in the relevant field, data processing systems and methods for automatically redacting unstructured data from a data subject access request, according to various embodiments described herein, may be, for example, embodied as a computer system, a method, or a computer program product. Accordingly, various embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, particular embodiments may take the form of a computer program product stored on a computer-readable storage medium having computer-readable instructions (e.g., software) embodied in the storage medium. Various embodiments may take the form of web, mobile, and/or wearable computer-implemented computer software. Any suitable computer-readable storage medium may be utilized including, for example, hard disks, compact disks, DVDs, optical storage devices, and/or magnetic storage devices.

It should be understood that each step described herein as being executed by an unstructured data redaction system or systems (and/or other steps described herein), and any combinations of such steps, may be implemented by a computer executing computer program instructions. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus to create means for implementing the various steps described herein.

These computer program instructions may also be stored in a computer-readable memory that may direct a computer or other programmable data processing apparatus to function in a particular manner such that the instructions stored in the computer-readable memory produce an article of manufacture that is configured for implementing the function specified in the flowchart step or steps. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart step or steps.

Accordingly, steps of the block diagrams and flowchart illustrations support combinations of mechanisms for performing the specified functions, combinations of steps for performing the specified functions, and program instructions for performing the specified functions. It should also be understood that each step, and combinations of such steps, may be implemented by special-purpose hardware-based computer systems that perform the specified functions or steps, or combinations of special purpose hardware and other hardware executing appropriate computer instructions.

Example System Architecture

FIG. 6 is a block diagram of a system 600 according to a particular embodiment. As may be understood from this figure, the system 600 may include one or more computer networks 610, a server 620, a storage device 630 (that may, in various embodiments, contain one or more databases of information that may include personal data), and/or one or more client computing devices such as a tablet computer 640, a desktop or laptop computer 650, a handheld computing device 660 (e.g., a cellular phone, a smart phone, etc.), a browser and Internet capable set-top box 670 connected with a television (e.g., a television 680), and/or a smart television 680 having browser and Internet capability. The client computing devices attached to the network may also, or instead, include scanners/copiers/printers/fax machines 690 having one or more hard drives (a security risk since copies/prints may be stored on these hard drives). The server 620, client computing devices, and storage device 630 may be physically located in a central location, such as the headquarters of an organization, for example, or in separate facilities. The devices may be owned or maintained by employees, contractors, or other third parties (e.g., a cloud service provider, a copier vendor). In particular embodiments, the computer networks 610 facilitate communication between the server 620, one or more client computing devices 640, 650, 660, 670, 680, 690, and storage device 630.

The computer networks 610 may include any of a variety of types of wired and/or wireless computer networks and any combination therefore, such as the Internet, a private intranet, a public switched telephone network (PSTN), or any other type of network. The communication link between the server 620, one or more client computing devices 640, 650, 660, 670, 680, 690, and storage device 630 may be, for example, implemented via a Local Area Network (LAN), a Wide Area Network (WAN), and/or via the Internet.

Example Computer Architecture

FIG. 7 illustrates a diagrammatic representation of the architecture of a computer 700 that may be used within the system 600, for example, as a client computer (e.g., one of computing devices 640, 650, 660, 670, 680, 690, shown in FIG. 6) and/or as a server computer (e.g., server 620 shown in FIG. 6). In exemplary embodiments, the computer 700 may be suitable for use as a computer within the context of the system 600 that is configured to operationalize the various aspects of the exemplary unstructured data redaction systems describe herein. In particular embodiments, the computer 700 may be connected (e.g., networked) to other computers in a LAN, an intranet, an extranet, and/or the Internet. As noted above, the computer 700 may operate in the capacity of a server or a client computer in a client-server network environment or as a peer computer in a peer-to-peer (or distributed) network environment. The computer 700 may be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, smart phone, a web appliance, a server, a network router, a switch or bridge, or any other computer capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that computer. Further, while only a single computer is illustrated, the term “computer” as used herein shall also be taken to include any collection of computers that individually or jointly execute a set (or multiple sets) of instructions to perform any of the methodologies discussed herein.

The exemplary computer 700 may include a processor 702, a main memory 704 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.), a static memory 706 (e.g., flash memory, static random access memory (SRAM), etc.), and/or a data storage device 718, which communicate with each other via a bus 732.

The processor 702 represents one or more general-purpose processing devices such as a microprocessor, a central processing unit, or the like. More particularly, the processor 702 may be a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, or a processor or processors implementing other instruction sets and/or any combination of instruction sets. The processor 702 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. The processor 702 may be configured to execute processing logic 726 for performing various operations and steps discussed herein.

The computer 700 may further include a network interface device 708. The computer 700 also may include a video display unit 710 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device 712 (e.g., a keyboard), a cursor control device 714 (e.g., a mouse), and/or a signal generation device 716 (e.g., a speaker). The data storage device 718 may include a non-transitory computer-readable storage medium 730 (also known as a non-transitory computer-readable storage medium or a non-transitory computer-readable medium) on which may be stored one or more sets of instructions 722 (e.g., software, software modules) embodying any one or more of the methodologies and/or functions described herein. The instructions 722 may also reside, completely or at least partially, within the main memory 704 and/or within the processor 702 during execution thereof by computer 700, the main memory 704 and the processor 702 also constituting computer-accessible storage media. The instructions 722 may further be transmitted or received over a network 610 via network interface device 708.

While the computer-readable storage medium 730 is shown in an exemplary embodiment to be a single medium, the terms “computer-readable storage medium” and “machine-accessible storage medium” should be understood to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the sets of instructions. The term “computer-readable storage medium” should also be understood to include any medium or media that is capable of storing, encoding, and/or carrying a set of instructions for execution by a computer and that cause a computer to perform any one or more of the methodologies of described herein. The term “computer-readable storage medium” should accordingly be understood to include, but not be limited to, solid-state memories, optical and magnetic media, etc.

Exemplary System Platform

According to various embodiments, the processes and logic flows described in this specification may be performed by a system (e.g., system 600) that includes, but is not limited to, one or more programmable processors (e.g., processor 702) executing one or more computer program modules to perform functions by operating on input data and generating output, thereby tying the process to a particular machine (e.g., a machine programmed to perform the processes described herein). This includes processors located in one or more of client computers (e.g., client computing devices 640, 650, 660, 670, 680, 690 of FIG. 6). These devices connected to the computer networks 610 may access and execute one or more Internet browser-based program modules that are “served up” through the computer networks 610 by one or more servers (e.g., server 620 of FIG. 6), and the data associated with the program may be stored on a one or more storage devices, which may reside within a server or computing device (e.g., main memory 704, static memory 706), be attached as a peripheral storage device to the servers or computing devices, and/or attached to the network (e.g., storage 630).

Advanced Processing in Various Embodiments

In various embodiments, the unstructured data redaction system uses advanced processing techniques to locate personal data, generate identity graphs, perform unstructured data redaction, and/or implement any of the various aspects of the disclosed unstructured data redaction systems and methods. In particular embodiments, the unstructured data redaction system may determine a type of one or more pieces of personal data that are stored in one or more data sources using advanced processing techniques that may include artificial intelligence, artificial intelligence, machine learning, neural networking, big data methods, natural language processing, contextual awareness, and/or continual learning (in any combination). In particular embodiments, the unstructured data redaction system may match one or more pieces of personal data that are stored in one or more data sources with one or more other pieces of personal data that are stored in one or more other data sources using any one or more of these advanced processing techniques and/or any combination thereof. In various embodiments, the unstructured data redaction system may use any such advanced processing techniques to mine various data sources for personal data stored therein to determine data types and relationships. In various embodiments, the unstructured data redaction system may use any such advanced processing techniques to perform any of the processing (e.g., execute any of the modules) described herein to locate, identify, retrieve, modify, and/or perform any other functions related to personal data, including generating identity graphs and performing unstructured data redaction.

In particular embodiments, one or more neural networks may be used to implement any of the advanced processing techniques described herein. A neural network, according to various embodiments, may include a plurality of nodes that mimic the operation of the human brain, a training mechanism that analyzes supplied information, and/or a personal data location engine for performing any one or more of the functions involving personal data as described herein, including, but not limited to, generating identity graphs and performing unstructured data redaction. The neural network may also perform any of the processing (e.g., execute any of the modules) described herein to locate, identify, retrieve, modify, and/or perform any other functions on personal data. In various embodiments, each of the nodes may include one or more weighted input connections, one or more transfer functions that combine the inputs, and one or more output connections. In particular embodiments, the neural network is a variational autoencoder (AE) neural network, a denoising AE neural network, any other suitable neural network, or any combination thereof.

CONCLUSION

Although embodiments above are described in reference to various automatic unstructured data redaction and personal data discovery systems, it should be understood that various aspects of the unstructured data redaction system described above may be applicable to other types of systems, in general.

While this specification contains many specific embodiment details, these should not be construed as limitations on the scope of any embodiment or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments may also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment may also be implemented in multiple embodiments separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination may in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.

Similarly, while operations are described in a particular order, this should not be understood as requiring that such operations be performed in the particular order described or in sequential order, or that all described operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems may generally be integrated together in a single software product or packaged into multiple software products.

Many modifications and other embodiments will come to mind to one skilled in the art to which this disclosure pertains having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for the purposes of limitation.

Claims

1.-20. (canceled)

21. A method comprising:

receiving, by computing hardware, a request for target data for a data subject, wherein the request comprises identifying data for the data subject;

identifying, by the computing hardware, a categorization for the identifying data;

identifying, by the computing hardware, that an identity graph for the data subject includes the identifying data mapped to the categorization;

identifying, by the computing hardware, a data type identifier associated with a data source that matches the categorization, wherein the data type identifier can be used in querying the data source;

querying, by the computing hardware based on the data type identifier, the data source using the identifying data;

receiving, by the computing hardware, responsive data for the data subject from querying the data source, wherein the responsive data comprises a first piece of data matching the identifying data;

identifying, by the computing hardware, the categorization for a second piece of data found in the responsive data;

determining, by the computing hardware and based on the identity graph, that the second piece of data found in the responsive data does not match the identifying data; and

responsive to determining that the second piece of data found in the responsive data does not match the identifying data, redacting, by the computing hardware, the second piece of data from the responsive data.

22. The method of claim 21, wherein identifying the categorization for the identifying data comprises:

determining a confidence score for the categorization;

determining that the confidence score satisfies a threshold value; and

determining the categorization for the identifying data based on the confidence score satisfying the threshold value.

23. The method of claim 21 further comprising:

scanning the data source to identify that the data type identifier is available in the data source for the data subject;

identifying that the data type identifier is associated with the categorization;

generating the identity graph for the data subject to include the data type identifier mapped to the categorization; and

generating metadata comprising the identity graph and identifying that the data type identifier can be used in querying the data source.

24. The method of claim 23, wherein identifying that the data type identifier is associated with the categorization comprises:

determining a classification for the data type identifier and corresponding confidence score for the classification;

determining that the corresponding confidence score satisfies a threshold value; and

responsive to determining that the corresponding confidence score satisfies the threshold value, determining that the categorization is associated with the data type identifier based on the classification.

25. The method of claim 23, wherein the metadata comprises a manner of accessing the responsive data for the data source and querying the data source is carried out using the manner.

26. The method of claim 21 further comprising:

generating a graphical user interface for a browser application executed on a user device by configuring a display element to display at least a portion of the responsive data without the second piece of data as the target data; and

transmitting an instruction to the browser application causing the browser application to present the graphical user interface on the user device.

27. The method of claim 21, wherein the target data comprises personal data of the data subject.

28. A system comprising:

a non-transitory computer-readable medium storing instructions; and

processing hardware communicatively coupled to the non-transitory computer-readable medium, wherein the processing hardware is configured to execute the instructions and thereby perform operations comprising: receiving a request for target data, wherein the request comprises identifying data; identifying a categorization for the identifying data; identifying that an identity graph includes the identifying data mapped to the categorization; identifying a data type identifier associated with a data source that matches the categorization, wherein the data type identifier can be used in querying the data source; responsive to identifying the data type identifier can be used in querying the data source, querying the data source using the identifying data; receiving responsive data from querying the data source, wherein the responsive data comprises a first piece of data matching the identifying data; identifying the categorization for a second piece of data found in the responsive data; determining, based on the identity graph, that the second piece of data found in the responsive data does not match the identifying data; and responsive to determining that the second piece of data found in the responsive data does not match the identifying data, redacting the second piece of data from the responsive data.

29. The system of claim 28, wherein identifying the categorization for the identifying data comprises:

determining a confidence score for the categorization; and

determining the categorization for the identifying data based on the confidence score.

30. The system of claim 28, wherein the operations further comprise:

scanning the data source to identify that the data type identifier is available in the data source;

identifying that the data type identifier is associated with the categorization;

generating the identity graph to include the data type identifier mapped to the categorization; and

generating metadata comprising the identity graph and identifying that the data type identifier can be used in querying the data source.

31. The system of claim 30, wherein identifying that the data type identifier is associated with the categorization comprises:

determining a classification for the data type identifier and corresponding confidence score for the classification;

determining that the corresponding confidence score satisfies a threshold value; and

responsive to determining that the corresponding confidence score satisfies the threshold value, determining that the categorization is associated with the data type identifier based on the classification.

32. The system of claim 30, wherein the metadata comprises a manner of accessing the responsive data for the data source and querying the data source is carried out using the manner.

33. The system of claim 28, wherein the operations further comprise:

generating a graphical user interface for a browser application executed on a user device by configuring a display element to display at least a portion of the responsive data without the second piece of data as the target data; and

transmitting an instruction to the browser application causing the browser application to present the graphical user interface on the user device.

34. A method comprising:

receiving, by computing hardware, a request for target data for a data subject, wherein the request comprises identifying data for the data subject;

identifying, by the computing hardware, a first categorization for the identifying data;

identifying, by the computing hardware, that an identity graph for the data subject includes the identifying data mapped to the first categorization;

identifying, by the computing hardware, a first data type identifier associated with a first data source that matches the first categorization, wherein the first data type identifier can be used in querying the first data source;

querying, by the computing hardware using the first data type identifier, the first data source;

receiving, by the computing hardware, first responsive data for the data subject from querying the first data source, wherein the first responsive data comprises a first piece of data matching the identifying data;

identifying, by the computing hardware, a second categorization for a second piece of data in the first responsive data;

identifying, by the computing hardware, that the identity graph for the data subject includes the second piece of data mapped to the second categorization;

identifying, by the computing hardware, a second data type identifier associated with a second data source that matches the second categorization, wherein the second data type identifier can be used in querying the second data source;

querying, by the computing hardware using the second data type identifier, the second data source;

receiving, by the computing hardware, second responsive data for the data subject from querying the second data source, wherein the second responsive data comprises a third piece of data matching the second piece of data;

identifying, by the computing hardware, the second categorization for a fourth piece of data found in the second responsive data;

determining, by the computing hardware and based on the identity graph, that the fourth piece of data found in the second responsive data does not match the second piece of data; and

responsive to determining that the fourth piece of data found in the second responsive data does not match the second piece of data, redacting, by the computing hardware, the fourth piece of data from the second responsive data.

35. The method of claim 34, wherein identifying the second categorization for the second piece of data comprises:

determining a confidence score for the second categorization;

determining that the confidence score satisfies a threshold value; and

determining the second categorization for the second piece of data based on the confidence score satisfying the threshold value.

36. The method of claim 34 further comprising:

scanning the first data source to identify that the first data type identifier is available in the first data source for the data subject;

identifying that the first data type identifier is associated with the first categorization;

generating the identity graph for the data subject to include the first data type identifier mapped to the first categorization;

scanning the second data source to identify that the second data type identifier is available in the second data source for the data subject;

identifying that the second data type identifier is associated with the second categorization;

updating the identity graph for the data subject to include the second data type identifier mapped to the second categorization; and

generating metadata comprising the identity graph and identifying that the first data type identifier can be used in querying the first data source and the second data type identifier can be used in querying the second data source.

37. The method of claim 36, wherein identifying that the first data type identifier is associated with the first categorization comprises:

determining a classification for the first data type identifier and corresponding confidence score for the classification;

determining that the corresponding confidence score satisfies a threshold value; and

responsive to determining that the corresponding confidence score satisfies the threshold value, determining that the first categorization is associated with the first data type identifier based on the classification.

38. The method of claim 36, wherein the metadata comprises a first manner of accessing the first responsive data for the first data source and a second manner of accessing the second responsive data for the second data source, and querying the first data source is carried out using the first manner and querying the second data source is carried out using the second manner.

39. The method of claim 34 further comprising:

generating a graphical user interface for a browser application executed on a user device by configuring a display element to display at least a portion of the first responsive data and at least a portion of the second responsive data without the fourth piece of data as the target data; and

transmitting an instruction to the browser application causing the browser application to present the graphical user interface on the user device.

40. The method of claim 34, wherein the target data comprises personal data of the data subject.