SECURING VISUAL INFORMATION ON IMAGES FOR DOCUMENT CAPTURE
Techniques to provide secure access to data are disclosed. An indication that an operator is assigned to index a data value extracted from a document image is received. A snippet or other partial image showing just a portion of the document image that includes a text or other content image portion that corresponds to the data value extracted from the document image is displayed to the operator. The data value is included in a subset of data values extracted from the document image, and access to the subset of extracted data values is provided to the operator without providing access to one or more other portions of the document image associated with extracted data values not included in the subset.
This application is a continuation of co-pending U.S. patent application Ser. No. 13/720,654, entitled SECURING VISUAL INFORMATION ON IMAGES FOR DOCUMENT CAPTURE filed Dec. 19, 2012 which is incorporated herein by reference for all purposes.
BACKGROUND OF THE INVENTIONIn document capture, a paper document is scanned and may contain confidential information such as credit card numbers, taxpayer ID, etc. While it is possible for such data to be automatically extracted using optical character recognition, it is not always accurate and there may be a need for a human operator to validate the information against what is on the paper. If the operator has access to the document image, then the confidential information may be exposed, unless it is redacted.
Redaction requires additional processing and/or human work, and is prone to errors and omissions, e.g., due to information appearing in an unexpected place, such as handwritten in a margin, and/or due to information that should be protected from disclosure appearing in multiple places in a document and not being redacted in all places.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
Performing data validation in a document capture context by selectively displaying to a given operator only those document portions that correspond to data entry form fields to be validated by that operator is disclosed. In document capture and initial data extraction, the location within the document image of the text or other content image that was processed to determine the extracted data value for a specific corresponding data entry form field is determined and recorded. The known location of each text field is used in data validation to display to a particular operator only the portion of the original document image that corresponds to that field, such as a snippet. The original document image (full page or pages), and any portions not to be validated by that operator, may be hidden from the operator. In some embodiments, only data entry form fields to be validated by an operator are made available to be displayed to that operator. In some embodiments, by hiding a data entry form field that the operator is prohibited from seeing, the corresponding snippet or other partial image of the original document image is also hidden.
In some embodiments, as an operator finishes validation of a field, indicated for example by pressing the “enter” key or selecting another key or on screen control, the system automatically pans to the next data entry form field, retrieves and displays near the form field a corresponding document image snippet. In this way, the operator can navigate through the form and corresponding portions of the document image without retargeting, i.e., without having to redirect their eyes to a different point or points on the screen.
In various embodiments, the techniques disclosed above are applied to provide secure validation and/or entry of data in a document capture or other manual indexing context. Data validation and/or other manual index is performed, using techniques disclosed herein, by displaying to any given operator only those portions of a document image that the operator may need to view to perform tasks assigned to that operator.
As noted above, human redaction, even if fully effective, still exposes confidential information due to unforeseen circumstances and introduces a point of slow-down as documents are funneled through this small set of privileged operators for visual inspection. Potential disadvantages of automatic redaction include, without limitation: a) it requires an additional process step to manipulate the image and add redaction marks and manage the redacted images; b) for dynamic privilege-based scenarios, where different operators may be allowed to see different combinations of secured fields, redaction is impractical because there are too many combinations of redacted pages that may be required; and c) it is possible for redactions to be applied incorrectly. For semi-structured documents (for example, invoices that can be structurally different), or information that is not written in the expected area, it is possible for the sensitive data to be located outside of the redaction. In this way, an operator who has access to a full the page image may still be able to see that data.
In the example shown in
Using techniques disclosed herein, document capture users are able to perform document capture while disclosing potentially sensitive information to human operators on a “need to know” basis. In various embodiments, no part of the original page is exposed to any given operator, except for those zones required by the validation or other task assigned to that operator. To support scenarios where the original page cannot be legally sent to third party operators, the original page is also not sent to the client 212. In addition, no additional configuration is needed for redaction, and the document capture process does not need to create and manage redacted pages.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
Claims
1. A method of providing secure access to data extracted from a document image, comprising:
- classifying the document image according to document type;
- creating an instance of a type-specific data entry form corresponding to the document image;
- extracting data from the document image;
- populating the instance of the type-specific data entry form corresponding to the document image using the data extracted from the document image;
- receiving an indication that an operator is assigned to index a data value extracted from a document image; and
- displaying to the operator a subset of the instance of the type-specific data entry form corresponding to the document image so as to show just a portion of the document image that includes a text or other content image portion that corresponds to the data value extracted from the document image.
2. The method of claim 1, further comprising not providing access to the entire document image.
3. The method of claim 1, further comprising determining in the course of document capture and storing for each data value extracted a corresponding location in the document image relating to the subset of the instance of the type-specific data entry form corresponding to the document image that was used to extract that data value.
4. The method of claim 3, further comprising storing the instance of the type-specific data entry form corresponding to the document image.
5. The method of claim 1, further comprising determining the subset of extracted data values to be displayed to the operator.
6. The method of claim 5, wherein the operator comprises a first operator, the subset of extracted data values comprises a first subset of extracted data values, and further comprising assigning to a second operator a second subset of the extracted data values.
7. The method of claim 5, wherein determining the subset of extracted data values to be displayed to the operator includes determining a permitted combination of extracted data values that can be displayed to the operator.
8. The method of claim 7, wherein the determination is based at least in part on one or more of an identity, a role, a group membership, a level of trust, and another attribute associated with the operator.
9. The method of claim 5, wherein determining the subset of extracted data values to be displayed to the operator includes applying one or more of a rule, a policy, and another definition.
10. The method of claim 1, further comprising determining the subset of the instance of the type-specific data entry form corresponding to the document image to show to the operator according to a degree of confidence with which the data value has been determined based on a corresponding portion of the document image.
11. A system to provide secure access to data extracted from a document image, comprising:
- a display device; and
- a processor coupled to the display and configured to: classify the document image according to document type; create an instance of a type-specific data entry form corresponding to the document image;
- extracting data from the document image; populate the instance of the type-specific data entry form corresponding to the document image using the data extracted from the document image; receive an indication that an operator is assigned to index a data value extracted from a document image; and display to the operator via the display device a subset of the instance of the type-specific data entry form corresponding to the document image so as to show just a portion of the document image that includes a text or other content image portion that corresponds to the data value extracted from the document image.
12. The system of claim 11, wherein the processor is further configured to determine and store for each data value extracted from the document image a corresponding location in the document image of the data associated with subset of the instance of the type-specific data entry form corresponding to the document image that was used to extract that data value.
13. The system of claim 12, wherein the processor is further configured to store the instance of the type-specific data entry form corresponding to the document image.
14. The system of claim 11, wherein the processor is further configured to determine the subset of extracted data values to be displayed to the operator.
15. The system of claim 14, wherein the operator comprises a first operator, the subset of extracted data values comprises a first subset of extracted data values, and the processor is further configured to assign to a second operator a second subset of the extracted data values.
16. The system of claim 14, wherein determining the subset of extracted data values to be displayed to the operator includes determining a permitted combination of extracted data values that can be displayed to the operator.
17. The system of claim 16, wherein the determination is based at least in part on one or more of an identity, a role, a group membership, a level of trust, and another attribute associated with the operator.
18. The system of claim 14, wherein determining the subset of extracted data values to be displayed to the operator includes applying one or more of a rule, a policy, and another definition.
19. The system of claim 11, wherein the processor is configured to provide to the operator access to the subset of extracted data values and for each a subset of the instance of the type-specific data entry form corresponding to the document image without providing access to one or more other portions of the document image associated with extracted data values not included in the subset at least in part by assigning to the operator a task set that includes validation of just the extracted data values included in the subset and wherein the processor is further configured to provide to any given operator access only to subset of the instance of the type-specific data entry form corresponding to the document image associated with extracted data values assigned to that operator to be validated.
20. A computer program product to provide secure access to data extracted from a document image, the computer program product being embodied in a tangible computer readable storage medium and comprising computer instructions for:
- classifying the document image according to document type;
- creating an instance of a type-specific data entry form corresponding to the document image;
- extracting data from the document image;
- populating the instance of the type-specific data entry form corresponding to the document image using the data extracted from the document image;
- receiving an indication that an operator is assigned to index a data value extracted from a document image; and
- displaying to the operator a subset of the instance of the type-specific data entry form corresponding to the document image so as to show just a portion of the document image that includes a text or other content image portion that corresponds to the data value extracted from the document image.
Type: Application
Filed: Apr 13, 2015
Publication Date: Jul 30, 2015
Patent Grant number: 9471800
Inventor: Ming Fung Ho (Fremont, CA)
Application Number: 14/685,280