AI-AUGMENTED AUDITING PLATFORM INCLUDING TECHNIQUES FOR AUTOMATED ASSESSMENT OF VOUCHING EVIDENCE

Info

Publication number: 20230005075
Type: Application
Filed: Jun 30, 2022
Publication Date: Jan 5, 2023
Applicant: PricewaterhouseCoopers LLP (New York, NY)
Inventors: Chung-Sheng LI (Scarsdale, NY), Winnie CHENG (West New York, NJ), Mark John FLAVELL (Madison, NJ), Lori Marie HALLMARK (Xenia, OH), Nancy Alayne LIZOTTE (Saline, MI), Kevin Ma LEONG (Randolph, NJ), Di ZHU (Jersey City, NJ), Kevin Michael O'ROURKE (New York, NY), Eun Kyung KWON (New York, NY), Vandit NARULA (Monroe Township, NJ), Weichao CHEN (Secaucus, NJ), Maria Jesus Perez RAMIREZ (New York, NY)
Application Number: 17/854,329

Abstract

Systems and methods for determining whether an electronic document constitutes vouching evidence is provided. The system may receive ERP item data and generate hypothesis data based thereon, and may receive electronic document data and extract ERP information therefrom. The system may then apply one or more models to compare the hypothesis data to the extracted ERP information to determine whether the electronic document constitutes vouching evidence for the ERP item. Systems and methods for verifying an assertion against a source document are provided. The system may receive first data indicating an unverified assertion and second data comprising a plurality of source documents. The system may apply one or more extraction models to extract a set of key data from the plurality of source documents and may apply one or more matching models to compare the first data to the set of key data to determine whether vouching criteria are met.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 63/217,119 filed Jun. 30, 2021; U.S. Provisional Application No. 63/217,123 filed Jun. 30, 2021; U.S. Provisional Application No. 63/217,127 filed Jun. 30, 2021; U.S. Provisional Application No. 63/217,131 filed Jun. 30, 2021; and U.S. Provisional Application No. 63/217,134, filed Jun. 30, 2021, the entire contents of each of which are incorporated herein by reference.

FIELD

This relates generally to automated data processing and validation of data, and more specifically to AI-augmented auditing platforms including techniques for assessment of vouching evidence.

BACKGROUND

When performing audits, or when otherwise ingesting, reviewing, and analyzing documents or other data, there is often a need to establish that one or more statements, assertions, or other representations of fact are sufficiently substantiated by documentary evidence. In the context of performing audits, establishing that one or more statements (e.g., a financial statement line item (FSLI)) is sufficiently supported by documentary evidence is referred to as vouching.

SUMMARY

When performing audits, or when otherwise ingesting, reviewing, and analyzing documents or other data, there is often a need to establish that one or more statements, assertions, or other representations of fact are sufficiently substantiated by documentary evidence. In the context of performing audits, establishing that one or more statements (e.g., a financial statement line item (FSLI)) is sufficiently supported by documentary evidence is referred to as vouching.

In automated auditing systems that seek to ingest and understand documentary evidence in order to vouch for one or more statements (e.g., FSLI's), known document-understanding techniques are sensitive to the structure of the documents that are ingested and analyzed. Accordingly, known document-understanding techniques may fail to correctly recognize and identify certain entities referenced in documents, due for example to a misinterpretation of the structure or layout of one or more ingested documents. Accordingly, there is a need for improved document-understanding (e.g., document ingestion and analysis) techniques that are more robust to various document structures and layouts and that provide higher accuracy for entity recognition in documents. There is a need for such improved document-understanding techniques configured to be able to be applied in automated auditing systems in order to determine whether one or more documents constitutes sufficient vouching evidence to substantiate one or more assertions (e.g., FSLI's).

Disclosed herein are improved document-understanding techniques that may address one or more of the above-identified needs. In some embodiments, as explained herein, the document-understanding techniques disclosed herein may leverage a priori knowledge (e.g., information available from a data source separate from the document(s) being assessed for sufficiency for vouching purposes) of one or more entities in extracting and/or analyzing information from one or more documents. In some embodiments, the document-understanding techniques may analyze the spatial configuration of words, paragraphs, or other content in a document in extracting and/or analyzing information from one or more documents.

Furthermore, pursuant to the need to perform automated vouching, there is a need for improved systems and methods for vouching ERP entries against bank statement data in order to verify payment.

In some embodiments, a system is configured vouch payment data against evidence data. More specifically, a system may be configured to provide a framework that performs ERP payment activities vouching against physical bank statement. The system may include a pipeline that perform information extraction and characteristics extraction from bank statements, and the system may leverage one or more advanced data structures and matching algorithms to perform one-to-many matching between ERP data and bank statement data. The payment vouching systems provided herein may thus automate the process of finding material evidence such as remittance advice or bank statements to corroborate ERP payment entries.

In some embodiments, a first system is provided, the first system being for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, the first system comprising one or more processors configured to cause the first system to: receive data representing an ERP item; generate hypothesis data based on the received data represent an ERP item; receive an electronic document; extract ERP information from the document; apply one or more models to the hypothesis data and to extracted ERP information in order to generate output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.

In some embodiments of the first system, extracting the instance of ERP information comprises generating first data representing information content of the instance of ERP information and second data representing a document location for the instance of ERP information

In some embodiments of the first system, the ERP information comprises one or more of: a purchase order number, a customer name, a date, a delivery term, a shipping term, a unit price, and a quantity.

In some embodiments of the first system, applying the one or more models to generate output data is based on preexisting information regarding spatial relationships amongst instances of ERP information in documents.

In some embodiments of the first system, the preexisting information comprises a graph representing spatial relationships amongst instances of ERP information in documents.

In some embodiments of the first system, the one or more processors are configured to cause the system to augment the hypothesis data based on one or more models representing contextual data.

In some embodiments of the first system, the contextual data comprises information regarding one or more synonyms for the information content of the instance of ERP information.

In some embodiments of the first system, the instance of ERP information comprises a single word in the document.

In some embodiments of the first system, the instance of ERP information comprises a plurality of words in the document.

In some embodiments of the first system, the one or more processors are configured to determine whether the ERP information vouches for the ERP item.

In some embodiments of the first system, determining whether the ERP information vouches for the ERP item comprises generating and evaluating a similarity score representing a comparison of the ERP information and the ERP item.

In some embodiments of the first system, the similarity generated by comparing an entity graph associated with the ERP information to an entity graph associated with the ERP item.

In some embodiments of the first system, extracting the ERP information from the document comprises applying a fingerprinting operation to determine, based on the receive data representing an ERP item, a characteristic of a data extraction operation to be applied to the electronic document.

In some embodiments, a first non-transitory computer-readable storage medium is provided, the first non-transitory computer-readable storage medium storing instructions for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, the instructions configured to be executed by a system comprising one or more processors to cause the system to: receive data representing an ERP item; generate hypothesis data based on the received data represent an ERP item; receive an electronic document; extract ERP information from the document; apply one or more models to the hypothesis data and to extracted ERP information in order to generate output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.

In some embodiments, a first method is provided, the first method being for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, wherein the first method is performed by a system comprising one or more processors, the first method comprising: receiving data representing an ERP item; generating hypothesis data based on the received data represent an ERP item; receiving an electronic document; extracting ERP information from the document; applying one or more models to the hypothesis data and to extracted ERP information in order to generate output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.

In some embodiments, a second system is provided, the second system being for verifying an assertion against a source document, the second system comprising one or processors configured to cause the second system to: receive first data indicating an unverified assertion; receive second data comprising a plurality of source documents; apply one or more extraction models to extract a set of key data from the plurality of source documents; and apply one or more matching models to compare the first data to the set of key data to generate an output indicating whether one or more of the plurality of source documents satisfies one or more verification criteria for verifying the unverified assertion.

In some embodiments of the second system, the one or more extraction models comprise one or more machine learning models.

In some embodiments of the second system, the one or more matching models comprises one or more approximation models.

In some embodiments of the second system, the one or more matching models are configured to perform one-to-many matching between the first data and the set of key data.

In some embodiments of the second system, the one of more processors are configured to cause the system to modify one or more of the extraction models without modification of one or more of the matching models.

In some embodiments of the second system, the one of more processors are configured to cause the system to modify one or more of the matching models without modification of one or more of the extraction models.

In some embodiments of the second system, the unverified assertion comprises an ERP payment entry.

In some embodiments of the second system, the plurality of source documents comprises a bank statement.

In some embodiments of the second system, applying one or more matching models comprises generating a match score and generating a confidence score.

In some embodiments of the second system, applying one or more matching models comprises: applying a first matching model; if a match is indicated by the first matching model, generating a match score and a confidence score based on the first matching model; if a match is not indicated by the second matching model: applying a second matching model; if a match is indicated by the second matching model, generating a match score and a confidence score based on the second matching mode; and if a match is not indicated by the second matching model, generating a match score of 0.

In some embodiments, a second non-transitory computer-readable storage medium is provided, the second non-transitory computer-readable storage medium storing instructions for verifying an assertion against a source document, the instructions configured to be executed by a system comprising one or processors to cause the system to: receive first data indicating an unverified assertion; receive second data comprising a plurality of source documents; apply one or more extraction models to extract a set of key data from the plurality of source documents; and apply one or more matching models to compare the first data to the set of key data to generate an output indicating whether one or more of the plurality of source documents satisfies one or more verification criteria for verifying the unverified assertion.

In some embodiments, a second method is provided, the second method being for verifying an assertion against a source document, wherein the second method is executed by a system comprising one or processors, the second method comprising: receiving first data indicating an unverified assertion; receiving second data comprising a plurality of source documents; applying one or more extraction models to extract a set of key data from the plurality of source documents; and applying one or more matching models to compare the first data to the set of key data to generate an output indicating whether one or more of the plurality of source documents satisfies one or more verification criteria for verifying the unverified assertion.

In some embodiments, a third system, for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, is provided, the third system comprising one or more processors configured to cause the third system to: receive data representing an ERP item; generate hypothesis data based on the received data represent an ERP item; receive an electronic document; extract ERP information from the document; apply a first set of one or more models to the hypothesis data and to extracted ERP information in order to generate first output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; apply a second set of one or more models to the extracted ERP information in order to generate second output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; generate combined determination data, based on the first output data and the second output data, indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.

In some embodiments, a third non-transitory computer-readable storage medium is provided, the third non-transitory computer-readable storage medium storing instructions for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, the instructions configured to be executed by a system comprising one or more processors to cause the system to: receive data representing an ERP item; generate hypothesis data based on the received data represent an ERP item; receive an electronic document; extract ERP information from the document; apply a first set of one or more models to the hypothesis data and to extracted ERP information in order to generate first output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; apply a second set of one or more models to the extracted ERP information in order to generate second output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; generate combined determination data, based on the first output data and the second output data, indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.

In some embodiments, a third method, for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, is provided, wherein the third method is performed by a system comprising one or more processors, the third method comprising: receiving data representing an ERP item; generating hypothesis data based on the received data represent an ERP item; receiving an electronic document; extracting ERP information from the document; applying a first set of one or more models to the hypothesis data and to extracted ERP information in order to generate first output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; applying a second set of one or more models to the extracted ERP information in order to generate second output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; generating combined determination data, based on the first output data and the second output data, indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.

In some embodiments, any one or more of the features, characteristics, or aspects of any one or more of the above systems, methods, or non-transitory computer-readable storage media may be combined, in whole or in part, with one another and/or with any one or more of the features, characteristics, or aspects (in whole or in part) of any other embodiment or disclosure herein.

BRIEF DESCRIPTION OF THE FIGURES

Various embodiments are described with reference to the accompanying figures, in which:

FIG. 1 shows two examples of extracting entities from documents, in accordance with some embodiments.

FIG. 2 shows a system for data processing for an AI-augmented auditing platform, in accordance with some embodiments.

FIGS. 3A-3B depict a diagram of how a fingerprinting algorithm may be used as part of a process to render a decision about whether purchase order is vouched, in accordance with some embodiments.

FIG. 4 shows a diagram of a fingerprinting algorithm, document-understanding, and vouching algorithm, in accordance with some embodiments.

FIGS. 5A-5B show a diagram of a payment vouching method, in accordance with some embodiments.

FIG. 6 illustrates an example of a computer, according to some embodiments.

DETAILED DESCRIPTION

Active Document Comprehension for Assurance

When performing audits, or when otherwise ingesting, reviewing, and analyzing documents or other data, there is often a need to establish that one or more statements, assertions, or other representations of fact are sufficiently substantiated by documentary evidence. In the context of performing audits, establishing that one or more statements (e.g., a financial statement line item (FSLI)) is sufficiently supported by documentary evidence is referred to as vouching.

In automated auditing systems that seek to ingest and understand documentary evidence in order to vouch for one or more statements (e.g., FSLI's), known document-understanding techniques are sensitive to the structure of the documents that are ingested and analyzed. Accordingly, known document-understanding techniques may fail to correctly recognize and identify certain entities referenced in documents, due for example to a misinterpretation of the structure or layout of one or more ingested documents. Accordingly, there is a need for improved document-understanding (e.g., document ingestion and analysis) techniques that are more robust to various document structures and layouts and that provide higher accuracy for entity recognition in documents. There is a need for such improved document-understanding techniques configured to be able to be applied in automated auditing systems in order to determine whether one or more documents constitutes sufficient vouching evidence to substantiate one or more assertions (e.g., FSLI's).

Disclosed herein are improved document-understanding techniques that may address one or more of the above-identified needs. In some embodiments, as explained herein, the document-understanding techniques disclosed herein may leverage a priori knowledge (e.g., information available from a data source separate from the document(s) being assessed for sufficiency for vouching purposes) of one or more entities in extracting and/or analyzing information from one or more documents. In some embodiments, the document-understanding techniques may analyze the spatial configuration of words, paragraphs, or other content in a document in extracting and/or analyzing information from one or more documents.

In some embodiments, a document-understanding system is configured to perform automated hypothesis generation based on one or more data sets. The data sets on which hypothesis generation is based may include one or more sets of ingested documents, for example documents ingested in accordance with one or more document-understanding techniques described herein. In some embodiments, the data sets on which hypothesis generation is based may include enterprise resource planning (ERP) data. In some embodiments, the data (e.g., ERP data) may indicate one or more entities, for example a PO #, a customer name, a date, a delivery term, a shipping term, a unit price, and/or a quantity. The system may be configured to apply a priori knowledge (e.g., information available from a data source separate from the document(s) being assessed for sufficiency for vouching purposes) regarding one or more of the entities indicated in the data. The hypothesis generation techniques disclosed herein may enable more accurate vouching of ERP data with evidence from unstructured documents and other evidence sources.

The system may be configured to analyze spatial relationships and constellation among entities indicated in the data. For example, the position at which entities are indicated in a document (e.g., a unit price and a quantity indicated on a same line of a document versus on a different line of a document) may be analyzed. In some embodiments, the system may be configured to generate, store, and/or analyze a data structure, such as a graph data structure, that represents spatial relationships amongst a plurality of entities in one or more documents.

The system may be configured to apply one or more AI models to comprehend documents to identify and assess evidence to vouch for the validity of financial information reported in ERPs. The system may use the ERP data to weakly label and provide hypotheses to documents that are candidates for possible evidence. The system may further apply one or more name entity extraction models to provide additional bias-free information to overlay on top of these documents. The combination of these features may enable the system to validate whether candidate evidence is indeed vouching evidence (e.g., whether it meets vouching criteria) for a given ERP entry, including by providing a quantification/score of the system's confidence in the conclusion that the candidate evidence does or does not constitute vouching evidence.

In some embodiments, the system may be configured to receive ERP data and to apply one or more data processing operations (e.g., AI models) to the received data in order to generate hypothesis data. (Any data processing operation referenced herein may include application of one or more models trained by machine-learning.) The hypothesis data may consist of one or more content entities that the system hypothesizes to be indicated in the received data, for example: PO #, customer name, date, delivery term, shipping term, unit price, and/or quantity. The system may assess one or more of the following in generating hypothesis data and/or in assessing hypothesis data once it is generated: a priori knowledge (e.g., knowledge from one or more data sources aside from the ERP data source); spatial relationships amongst words, paragraphs, or other indications of entities within the ERP data (e.g., spatial relationships of words within a document), and/or constellations amongst entities (e.g., unit price & quantity appearing on the same line).

Following hypothesis generation, the system may apply one or more data processing operations (e.g., AI models) in order to augment one or more of the generated hypotheses. In some embodiments, the system may augment (or otherwise modify) a generated hypothesis on the basis of context data available to the system. In some embodiments, context data may include synonym data, such that the system may augment a hypothesis in accordance with synonym data. For example, hypothesis data that includes the word “IBM” may be augmented to additionally include the term “International Business Machines”.

The system may be configured to perform spatial entity extraction. In some embodiments, spatial entity extraction includes extracting entities (at the word-level and at the multi-word level) from a document to generate information regarding (a) the entity content/identity and (b) information regarding a spatial location of the entity (e.g., an absolute spatial location within a document and/or a spatial location/proximity/alignment/orientation with respect to one or more other entities within the document).

The system may be configured to perform one or more hypothesis testing operations in order to evaluate the likelihood of a match, for example based on calculating a similarity score. The likelihood of a match may be evaluated between ERP data on one hand and a plurality of documents on the other hand. In some embodiments, the likelihood of a match may be based on calculating a similarity score between the entity (or entities) representing the hypothesis and the entity (or entity graph) representing components within the documents.

The systems and methods provided herein may provide improvements over existing approaches, including by providing the ability to use contextual information guided by an audit process to aid in comprehension, to use contextual information to form hypotheses on the expected information to be extracted from documents, to allow the testing of these hypotheses to guide document comprehension, and/or to apply methods to mitigate and account for the possibility of biases introduced by contextual information (e.g., by adjusting a confidence score accordingly).

FIG. 1 depicts two examples of extracting entities from documents, in accordance with some embodiments.

FIG. 2 depicts a system 200 for data processing for an AI-augmented auditing platform, in accordance with some embodiments. The components labeled “hypothesis generation” and “active vouching” may, in some embodiments, include any one or more of the systems (and/or may apply any one or more of the methods) described herein.

In some embodiments each of the schematic blocks shown in FIG. 2 may represent a distinct module (e.g., each distinct module comprising one or more distinct computer systems including storage devices and/or one or more physical and/or virtual processors) configured to perform associated functionality. In some embodiments, any one or more of the schematic blocks shown in FIG. 2 may represent functionalities performed by a same module (e.g., by a same computer system).

As described below, system 200 may be configured to perform any one or processes for active vouching; passive vouching and tracing; and/or data integrity integration, for example as described herein.

As shown in FIG. 2, system 100 may include documents source 202, which may include any one or more computer storage devices such as databases, data stores, data repositories, live data feeds, or the like. Documents source 202 may be communicatively coupled to one or more other components of system 200 and configured to provide a plurality of document to system 200, such that the documents can be assessed to determine whether one or more data integrity criteria are met, e.g., whether the documents sufficiently vouch for one or more representations made by a set of ERP data. In some embodiments, system 200 may receive documents from documents source 202 on a scheduled basis, in response to a user input, in response to one or more trigger conditions being met, and/or in response to the documents being manually sent. Documents received from documents source 202 may be provided in any suitable electronic data format, for example as structured, unstructured, and/or semi-structured data. The documents may include, for example, spreadsheets, word processing documents, and/or PDFs.

System 200 may include OCR module 204, which may include any one or more processors configured to perform OCR analysis and/or any other text or character recognition/extraction based on documents received from documents source 202. OCR module 204 may generate data representing characters recognized in the received documents.

System 200 may include document classification module 206, which may include one or more processors configured to perform document classification of documents received from documents source 202 and/or from OCR module 204. Document classification module 206 may receive document data from documents source 202 and/or may receive data representing characters in documents from OCR module 204, and may apply one or more classification algorithms to the received data to apply one or more classifications to the documents received from documents source 202. Data representing the determined classifications may be stored as metadata in association with the documents themselves and/or may be used to store the documents in a manner according to their determined respective classification(s).

System 200 may include ERP data source 208, which may include any one or more computer storage devices such as databases, data stores, data repositories, live data feeds, or the like. Documents source 202 may be communicatively coupled to one or more other components of system 200 and configured to provide ERP data to system 200, such that the ERP data can be assessed to determine whether one or more data integrity criteria are met, e.g., whether the ERP data is sufficiently vouched by one or more documents (e.g., the documents provided by documents source 202). In some embodiments, one or more components of system 200 may receive ERP data from ERP data source 208 on a scheduled basis, in response to a user input, in response to one or more trigger conditions being met, and/or in response to the data being manually sent. ERP data received from ERP data source 208 may be provided in any suitable electronic data format. In some embodiments, ERP data may be provided in a tabular data format, including a data model that defines the structure of the data.

System 200 may include knowledge substrate 210, which may include any one or more data sources such as master data source 210a, ontology data source 210b, and exogenous knowledge data source 210c. The data sources included in knowledge substrate 210 may be provided as part of a single computer system, multiple computer systems, a single network, or multiple networks. The data sources included in knowledge substrate 210 may be configured to provide data to one or more components of system 200 (e.g., hypothesis generation module 212, normalization and contextualization module 222, and/or passive vouching and tracing module 224). In some embodiments, one or more components of system 200 may receive data from knowledge substrate 210 on a scheduled basis, in response to a user input, in response to one or more trigger conditions being met, and/or in response to the data being manually sent. Data received from knowledge substrate 210 may be provided in any suitable data format.

In some embodiments, interaction with knowledge substrate 210 may be query based. Interaction with knowledge substrate 210 may be in one or more of the following forms: question answering, information retrieval, query into knowledge graph engine, and/or inferencing engine (e.g., against inferencing rules).

Knowledge substrate 210 may include data such as ontology/taxonomy data, knowledge graph data, and/or inferencing rules data. Master data received from master data source 210a may include, for example, master customer data, master vendor data, and/or master product data. Ontology data received from ontology data source 210b may include, for example, IncoTerms data for international commercial terms that define the cost, liability, and/or insurance among the sell side, buy side, and shipper for shipping a product. Exogenous knowledge data source received from exogenous knowledge data source 210c may include, for example, knowledge external to a specific audit client. This knowledge could be related to the industry of the client, the geographic area of a client, and/or the entire economy.

System 200 may include hypothesis generation module 212, which may include one or more processors configured to generate hypothesis data. Hypothesis generation module 212 may receive input data from any one or more of: (a) document classification module 206, (b) ERP data source 208, and (c) knowledge substrate 210. Hypothesis generation module 212 may apply one or more hypothesis generation algorithms to some or all of the received data and may thereby generate hypothesis data. Hypothesis generation may be based on any one of, and/or a combination of: (1) ERP data, (2) document type data, (3) data regarding prior understanding of one or more documents. A generated hypothesis may represent where and what is expected to be found in documents data, based on previous exposure to similar documents. Document classification data (e.g., from document classification module 206), for one document and/or for a group of documents, maybe used to determine, augment, and/or weight hypothesis data generated by hypothesis generation module 212. In some embodiments, document content itself (e.g., document data received from documents source 202), as distinct from document classification data (e.g., as generated by document classification module 206) may not be used for hypothesis generation. In some embodiments, document content itself may be used, in addition to document classification data, for hypothesis generation. The hypothesis data generated by hypothesis generation module 212 may be provided in any suitable data format. In some embodiments, hypothesis data in the context of document understanding may be represented as sets of tuples (e.g., representing entity, location, and value), each of which represent what is expected to be found from the documents data.

As shown in FIG. 2, system 200 may provide for an “active vouching” pipeline and for a “passive vouching” pipeline that may each be applied, using some or all of the same underlying data, in parallel to one another. The two pipelines may be applied at the same time or one after the other. Below, the active vouching pipeline is described with respect to element 214, while the passive vouching pipeline is described with respect to elements 216-224.

System 200 may include active vouching module 214, which may include one or more processors configured to apply any one or more active vouching analysis operations. Active vouching module 214 may receive input data from one or more of: OCR module 204, document classification module 206, and hypothesis generation module 212. Active vouching module 214 may apply one or more active vouching analysis operations to some or all of the received data and may thereby generate active vouching output data. In some embodiments, an active vouching analysis operation may include a “fingerprinting” analysis operation. In some embodiments, active vouching or fingerprinting may include data processing operations configured to determine whether there exist one (or more) tuples (e.g., representing entity, location, and value) extracted from documents data that can match hypothesis data. Some embodiments of a fingerprinting analysis operation are described below with respect to FIGS. 3 and 4. In some embodiments, the active vouching output data generated by active vouching module 214 may be provided in any suitable data format. In some embodiments, the active vouching output may include data indicating one or more of the following: a confidence score indicating a confidence level as to whether there is a match (e.g., whether vouching criteria are met, whether there is a match for a hypothesis); a binary indication as to whether there is any match for a hypothesis, which may feedback iteratively into the fingerprinting process; and/or a location within a document corresponding to a hypothesis for which a confidence and/or a binary indication are generated. In some embodiments, the active vouching output may include four values: an entity name, an entity value, a location (indicating an exact or relative location of the entity), and a confidence value indicating a confidence value of the determined match.

In some embodiments, the active vouching operations performed by module 214 may leverage contextual knowledge to inform what information is sought in an underlying document. In some embodiments, the active vouching operations performed by module 214 may be considered “context aware” because they are able to draw on contextual information that is injected via hypothesis generation module 212 drawing on data received from knowledge substrate 210.

In some embodiments, the active vouching operations may include one or more deductive reasoning operations, which may include application of one or more rules-based approaches to evaluate document information (e.g., information received from OCR module 204). For example, a rules based approach may be used to determine that, if a document is a certain document type, then the document will be known to include certain associated data fields. In some embodiments, the deductive reasoning operation(s) may be used to calculate and/or adjust an overall weighting. In some embodiments, weighting may be used in integrating results from multiple approaches (e.g., an inductive approach and a deductive approach). A weighting may be trained using various machine learning methods.

In some embodiments, the active vouching operations may include one or more inductive reasoning operations that may be based on a previous calculation or determination, historical information, or one or more additional insights. In some embodiments, inductive reasoning operations may based on learning from previous instances of similar data (e.g., sample documents) to determine what may be expected from future data.

In some embodiments, active vouching module 214 may apply context awareness, deductive reasoning, and inductive reasoning together for hypothesis testing.

Turning now to the passive vouching pipeline (elements 216-224), system 200 may include three parallel pipelines within the passive vouching pipeline, as represented by template-based pipeline 216, templateless pipeline 218, and specialized pipeline 220. Each of pipelines 216-220 may comprise one or more processors configured to receive input data from OCR module 204 and/or from document classification module 206 and to process the received input data. Each of the pipelines 216-220 may apply respective data analysis operations to the received input data and may generate respective output data.

Template-based pipeline 216 may be configured to apply any one or more template-based analysis operations to the received document data and/or document classification data and to generate output data representing document contents, such as one or more tuples representing entity, location, and value for content extracted from the document. Template-based pipeline 216 may be configured to apply one or more document understanding models that are trained for a specific known format. Abbyy Flexicapture is an example of such template-based tool.

Templateless pipeline 218 may be configured to apply any one or more analysis operations to the received document data and/or document classification data and to generate output data representing document contents, such as one or more tuples representing entity, location, and value for content extracted from the document. Templateless pipeline 218 may be configured which to operate without any assumption that documents being analyzed have a presumed “template” for document understanding. In some embodiments, a templateless approach may be less accurate than a template-based tool, and may require more training against a larger training set as compared to a template-based tool.

Specialized pipeline 220 may be configured to apply any one or more analysis operations to the received document data and/or document classification data and to generate output data representing document contents. In some embodiments, specialized pipeline 220 may be configured to apply a signature analysis. In some embodiments, signature analysis may include signature detection, for example using a machine-learning algorithm configured to determine whether or not a signature is present. In some embodiments, additionally or alternatively to signature detection, signature analysis may include signature matching, for example using one or more data processing operations to determine a person whose signature matches a detected signature (for example by leveraging comparison to a library of known signatures).

In some embodiments, specialized pipeline 220 may be used when system 200 has access to outside information, such as information in addition to information from documents source 202 and from ERP data source 208. For example, specialized pipeline may be configured to use information from knowledge substrate 210 in analyzing the received data and generating output data.

In some embodiments, pipeline 220 may be configured to extract data from documents that includes additional data (or data in a different format) as compared to data that is extracted by pipelines 216 and 218. For example, pipeline 220 may extract data other than (or in addition to) a tuple representing entity, location, and value). The extracted data may include logo data, signature data (e.g., an image or other representation of the signature, an indication as to whether there is a signature, etc.), figures, drawings, or the like. For an extracted logo, output data may include the logo itself (e.g., an image or other representation of the signature), a location within the document, and/or a customer name matched to the logo. For an extracted signature, output data may include the signature itself (e.g., an image or other representation of the signature), a location within the document, and/or a customer name matched to the signature. For extracted handwriting, output data may include the handwriting itself (e.g., an image or other representation of the handwriting), a location within the document, a customer name matched to the handwriting, and/or text extracted from the handwriting. For an extracted figure, output data may include the figure itself (e.g., an image or other representation of the figure), a location within the document, and/or a bounding box for the figure.

System 200 may include normalization and contextualization module 222, which may include one or more processors configured to perform one or more data normalization and/or contextualization operations. Normalization and contextualization module 222 may receive input data from any one or more of: (a) template-based pipeline 216, (b) templateless pipeline 218, (c) specialized pipeline 220; and knowledge substrate 210. Normalization and contextualization module 222 may apply one or more normalization and contextualization operations to some or all of the received data and may thereby generate normalized and/or contextualized output data.

A normalization and contextualization data processing operation may determine context of an entity and/or may normalize an entity value so that it can be used for subsequent comparison or classification. Examples include (but are not limited to) the following: normalization of customer name data (such as alias, abbreviations, and potentially including parent/sibling/subsidiary when the name is used in the context of payment) based on master customer/vendor data; normalization of address data (e.g., based on geocoding, based on standardized addresses from a postal office, and/or based on customer/vendor data); normalization of product name and SKU based on master product data; normalization of shipping and payment terms based on terms (e.g., based on International Commerce Terms); and/or normalization of currency exchange code (e.g., based on ISO 4217).

The normalized and/or contextualized output data generated by normalization and contextualization module 222 may be provided in any suitable data format, for example as a set of tuples representing entity, entity location, normalized entity value, and confidence score.

System 200 may include passive vouching and tracing module 224, which may include one or more processors configured to perform one or more passive vouching and tracing operations. Passive vouching and tracing module 224 may receive input data from any one or more of: (a) normalization and contextualization module 222, (b) knowledge substrate 210, and (c) ERP data source 208. Passive vouching and tracing module 224 may apply one or more passive vouching and/or tracing operations to some or all of the received data and may thereby generate passive vouching and tracing output data. Passive vouching may comprise comparing values from a given transaction record (e.g., as represented in ERP data) with entity values extracted from documents data (which may be assumed to be the evidence that is associated with the transaction record). Passive tracing may comprise comparing values from a given document with a corresponding transaction record, e.g., from in the ERP. Comparison of entity values may be precise, such that the generated result indicates either a match or a mismatch, or the comparison may be fuzzy, such that the generated result comprises a similarity score.

The passive vouching and tracing output data generated by passive vouching and tracing module 224 may be provided in any suitable data format. The passive vouching and tracing operations performed by module 224 may be considered “context aware” because they are able to draw on contextual information received from knowledge substrate 210. In some embodiments, the passive vouching output may include four values: an entity name, an entity value, a location (indicating an exact or relative location of the entity), and a confidence value indicating a confidence value of the determined match.

Downstream of both the active vouching pipeline and the passive vouching pipeline, system 200 may be configured to combine the results of the active vouching and the passive vouching pipelines in order to generate a combined result.

System 200 may include data integrity integration module 226, which may include one or more processors configured to perform one or more data integrity integration operations. Data integrity integration module 226 may receive input data from any one or more of: (a) active vouching module 214 and (b) passive vouching and tracing module 224. Data integrity integration module 226 may apply one or more data integrity integration operations to some or all of the received data and may thereby data integrity integration output data. The data integrity integration output data generated by data integrity integration module 226 may be provided in any suitable data format, and may for example include a combined confidence score indicating a confidence level (e.g., a percentage confidence) by which system 200 has determined that the underlying documents vouch for the ERP information. In some embodiments, the data integrity integration output data may comprise a set of tuples—e.g., representing entity, match score, and confidence—for each of the entities that have been analyzed. A decision (e.g., a preliminary decision) on whether the evidence is considered to support the existence and accuracy of a record (e.g., an ERP record) may be rendered as part of the data integrity integration output data.

In some embodiments, the one or more data integrity integration operations applied by module 226 may process the input data from active vouching module 214 and passive vouching module 224 in accordance with one of the following four scenarios:

- Scenario 1—in embodiments in which active vouching module 214 and passive vouching module 224 each confirm an entity, the two confidence values associated with the two vouching methods may be combined with one another (e.g., through averaging and/or through a multiplication operation), including optionally by being used to boost one another, to generate an overall confidence level, or the higher of the two confidence levels may be chosen as the overall confidence level;
- Scenario 2—in embodiments in which active vouching module 214 confirms an entity but passive vouching module 224 does not confirm an entity, the confidence level from active vouching module 214 may be used as an overall confidence level (with or without downward adjustment to reflect the lack of confirmation by passive vouching module 224);
- Scenario 3—in embodiments in which passive vouching module 224 confirms an entity but active vouching module 214 does not confirm an entity, the confidence level from passive vouching module 224 may be used as an overall confidence level (with or without downward adjustment to reflect the lack of confirmation by active vouching module 214);
- Scenario 4—in embodiments in which active vouching module 214 and passive vouching module 224 generate conflicting results, the system may apply one or more operations to reconcile the conflicting results. In some embodiments, integrating result from passive and active vouching may comprise resolving an entity value, e.g., based on confidence level(s) obtained from passive and active approaches. This resolution may be performed for each individual entity.

FIGS. 3A-3B depict a diagram of how a fingerprinting algorithm may be used as part of a process to render a decision (e.g., a confidence value) about whether purchase order is vouched, in some embodiments, by the systems disclosed herein. FIGS. 3A-3B depict how two evidence sets may be used to generate an overall result indicating a vouching confidence level. In the example of FIGS. 3A-3B, “evidence set 1” may comprise output data generated by an active vouching algorithm, and may share any one or more characteristics in common with the output data generated by active vouching module 214 in system 200. In the example of FIGS. 3A-3B, “evidence set 2” may comprise output data generated by one or more document processing pipelines, and may share any one or more characteristics in common with the output data generated by pipelines 216, 218, and/or 220 in system 200. In some embodiments, the combination of evidence set 1 and evidence set 2, as shown in FIGS. 3A-3B, to generate a vouching decision and/or a confidence value (as shown, for example, in FIG. 3B), may correspond to any one or more of modules 222, 224, and 226 in system 200.

Fingerprinting is a technique that may leverage ERP data to aid document understanding and vouching. Fingerprinting uses the context from ERP as a fingerprint for how the system searches an unstructured document for evidence of a match. By knowing what PO characteristics to look for from the ERP entry (e.g., specific PO #, set of item numbers associated with this PO, total amount of this PO, etc.), the system may look for those evidences in the attached PO (unstructured document).

One advantage of fingerprinting is that it may provide important context that allows an AI algorithm to make better judgement of what it is seeing on a document, such that the system can achieve higher extraction accuracy and match rates. One drawback of fingerprinting is that, if not used carefully, it may introduce bias—e.g., causing the system to see “only what you want to see.” For example, there may be additional attachments (POs, transactions, statements) that bear no relationships to the ERP but yet should be carefully reviewed. Thus, in some embodiments fingerprinting should not be used alone, but rather should be combined with other vouching logic and algorithms to ensure accuracy and effectiveness.

In some embodiments, fingerprinting can include a simple search for an expected value, such as a particular PO number. As PO number is very unique, this may work well in most cases, giving the system confidence that if it found PBC2145XC01, it did indeed match on the expected PO number. However, other fields might not be as simple, for example, the field Quantity. Searching for a value of ‘1’ could return a number of matches on a single document and even more across an entire set of documents, giving the system little confident that it has indeed matched on Quantity. Thus, it is important to include the ability to measure the system's confidence, as well as to design additional algorithms and ML models to help improve confidence and hone in on the right match. For example, if the system that the Item #, Unit Price for the PO line with that Quantity are located nearby or resides on the same PO line, this gives the match higher confidence and can remove other spurious matches of the value “1”. Confidence in fingerprinting may be refined by combining what is learned from 1) template-based extraction, 2) template-less extraction, and 3) additional ML models and algorithms on top of search findings, to remove spurious matches and increase confidence in matches.

FIGS. 3A-3B show how various document-understanding components function together with fingerprinting, in accordance with some embodiments. The combination of functions shown in FIGS. 3A-3B may enable improved overall goals, including an increased percent of vouched entries and an increased confidence on vouched entries.

FIG. 4 shows a diagram of a fingerprinting algorithm, in accordance with some embodiments.

In some embodiments, a fingerprinting algorithm may generate output for PO Headers and/or PO Lines. The algorithm may support exact match (fuzzy=1.0) and fuzzy matches. The algorithm may use Elasticsearch to index OCR text extraction of unstructured documents for search and/or lookup. The algorithm may use entity extraction to identify and normalize dates. The algorithm may use one or more spatial models to identify PO Lines to reduce spurious matches. The algorithm may support derived total amount search. The algorithm may support delivery terms synonyms.

In some embodiments, the fingerprinting algorithm may include one or more of the following steps, sub-steps, and/or features:

1) Prepare the ERP data for search (prepare_master.ipynb).

- a) This puts it in a standard format for searching field content against unstructured document. If one follows same format, this can be applied to other ERP entries (invoices, shipment tracking number, etc.)
- b) Also, computes the total amount from the PO lines and will look for this derived total amount while going through the “PO Headers” in Step 6.
  2) Perform text extraction of PDFs using Abbyy Finereader FRE.
- a) This produces a_basic.XML that has all the text blocks.
  3) Create concatenated text document from these text blocks
  4) Perform entity extraction on text document
  5) Index text document into Elasticsearch (text plus entities and some metadata)
- a) Incorporate document classification model results so the system knows which ones are POs
  - i) Optional whether the system excludes indexing non-POs or marks it in elasticsearch
    6) Run fingerprinting search on PO headers
- a) For each field, analyze expected ERP data and generate text value candidates
  - i) For example, delivery terms will have a set of synonyms to the one in ERP as search candidates
  - ii) For example, date will be normalized to search against the date entities of documents
- b) Issue appropriate query against elasticsearch
  - i) Target at documents with same SO
  - ii) If non-POs were included, optionally limit to docclass=PO
- c) Evaluate elasticsearch results
  - i) Interpret and find fuzzy matches from elasticsearch highlighted text
  - ii) Compute fuzzy scores with search candidates
  - iii) Match if fuzzy score equal or above configured threshold
  - iv) compute confidence (1/number of matches)
    7) Run fingerprinting search on PO lines
- a) The PO lines search is run separately from the PO headers
- b) Run algorithm to identify PO lines
  - i) For each SO,
    - (1) From ERP, find all the item numbers, this is used as anchor
    - (2) Find all POs (document classification results) for this SO, and for each document
      - (a) Identify locations in text of all anchor values (i.e. in item numbers)
      - (b) Calculate spacing between anchor values (number of word token part)
      - (c) Calculate average of these spacing as line window width
    - (3) With the line window width and the location of the anchors, the system know the vicinity of values for a given PO line
- c) Run search for each ERP PO line, limited to the PO line window of text identified in previous step
  - i) For each PO line in ERP, look for the line values (e.g., Item #, Unit Price, Quantity, etc.) in the corresponding PO line window
    - (1) The window may be defined as: (location of anchor−window size, location of anchor+window size)
    - (2) This may be refined with more experiments
    - (3) Match if fuzzy score equal or above configured threshold
    - (4) Compute confidence (1/number of matches)

Payment Vouching for Assurance

Pursuant to the need to perform automated vouching, there is a need for improved systems and methods for vouching ERP entries against bank statement data in order to verify payment.

In some embodiments, a system is configured vouch payment data against evidence data. More specifically, a system may be configured to provide a framework that performs ERP payment activities vouching against physical bank statement. The system may include a pipeline that perform information extraction and characteristics extraction from bank statements, and the system may leverage one or more advanced data structures and matching algorithms to perform one-to-many matching between ERP data and bank statement data. The payment vouching systems provided herein may thus automate the process of finding material evidence such as remittance advice or bank statements to corroborate ERP payment entries.

The system may be configured to receive a data set comprising bank statement data, wherein the bank statement data may be provided, for example, in the form of PDF files or JPG files of bank statements. The system may apply one or more data processing operations (e.g., AI models) to the received bank statement data in order to extract information (e.g., key content and characteristics) from said data. The extracted information may be stored in any suitable output format, and/or may be used to generate one or more feature vectors representing one or more bank statements in the bank statement data.

The system may be configured to receive a data set comprising ERP data, wherein the ERP data may be comprise one or more ERP entries. The system may apply one or more data processing operations (e.g., AI models) to the received ERP data in order to extract information (e.g., key content and characteristics) from said data. The extracted information may be stored in any suitable output format, and/or may be used to generate one or more feature vectors representing one or more ERP entries in the ERP data.

The system may be configured to apply one or more algorithms (e.g., matching algorithms) to compare the information extracted from the bank statements against the information extracted from the ERP entries, and to thereby determine whether the bank statements sufficiently vouch the ERP entries. In some embodiments, performing the comparison may comprise applying an approximation algorithm configured to achieve better matching rates between ERP records and bank statements with minor numeric discrepancies, which may be caused, for example, due to currency conversion, rather than being indicative of substantive discrepancies. The system may determine, based on the similarity or dissimilarity of the information indicated by the two information sets, whether one or more vouching criteria are satisfied. The system may generate an output that indicates a level of matching between the bank statements and ERP entries (e.g., a similarity score), an indication of whether one or more vouching criteria (e.g., a threshold similarity score and/or threshold confidence level) are met, an indication of any discrepancies identified, and/or a level of confidence (e.g., a confidence score) in one or more conclusions reached by the system. In some embodiments, output data may be stored, transmitted, presented to a user, used to generate one or more visualizations, and/or used to trigger one or more automated system actions.

In some embodiments, the system may be configured in a modular manner, such that one or more data processing operations may be modified without modification or one or more feature engineering and/or data comparison operations, and vice versa. This may allow for the system to be configured and fine-tuned in accordance with changes in business priorities, requested new features, or evolution of legal or regulatory requirements.

FIGS. 5A-5B show a diagram of a payment vouching method 500, in accordance with some embodiments. In some embodiments, all or part of the method depicted in FIGS. 5A-5B may be applied by the systems described herein (e.g., system 200). In some embodiments, a payment vouching method may seek to match data representing one or more of the following: date, amount, customer name, and invoice number. As shown in FIG. 5A, the system may accept ERP payment journal data and bank statement data as inputs (optionally following data pre-processing and formatting). The bank statement data may be subject to one or more AI information extraction models to extract information regarding transaction category, customer name, and invoices. The system may then apply a first matching algorithm, for example a fuzzy matching algorithm, to compare the ERP data to the data extracted from the bank statements. If a match is detected, then the system may, among one or more other operations, apply one or more comparison and/or scoring operations in order to generate overall match score data and overall confidence data. If no match is detected, then the system may apply a second matching algorithm, for example an optimization algorithm that has been proposed to solve the Knapsack problem. If no match is detected by the second algorithm, then an overall match score of 0 may be generated. If a match is detected by the second algorithm, then the system may select an optimal subset candidate and may, among one or more other operations, apply one or more comparison and/or scoring operations in order to generate an overall match score and an overall confidence score. A more detailed description follows.

At block 502, in some embodiments, the system may receive data representing ERP information, for example by receiving data from an ERP payment journal data source. The data representing ERP information may be received automatically, according to a predefined schedule, in response to one or more trigger conditions being met, as part of a scraping method, and/or in response to a user input. The system may receive the ERP data in any acceptable format. In some embodiments, ERP data may be provided in a tabular data format, including a data model that defines the structure of the data. ERP data may be received from “account receivable” data or from “cash received” data. ERP data may be in tabular format including customer name, invoice data, and invoice amount.

At block 504, in some embodiments, the system may receive data representing one or more bank statements. The data representing the bank statements may be received automatically, according to a predefined schedule, in response to one or more trigger conditions being met, as part of a scraping method, and/or in response to a user input. The system may receive the bank statement data in any acceptable format, for example as a structured and/or unstructured document, including for example a PDF document. In some embodiments, the system may receive bank statement data in PDF format and/or CSV format. In some embodiments, the system may download electronic bank statement data (such as BAI/BAI2, Multicash, MT940). In some embodiments, the system may receive bank statement data via EDI and/or ISO 20022. In some embodiments, the system may receive bank statement data through one or more API aggregators such as Plaid and Yodlee.

At block 506, in some embodiments, the system may apply one or more information extraction models to the data representing the one or more bank statements. The one information extraction models may generate transaction category data 508, customer name data 510, and/or invoice data 512. The extracted information may be stored, displayed to a user, transmitted, and/or used for further processing for example as disclosed herein.

At block 514, in some embodiments, the system may apply one or more fuzzy matching algorithms. The one or more fuzzy matching algorithms may accept input data including (but not limited to) data representing ERP information from block 502, transaction category data 508, customer name data 510, and/or invoice data 512. The one or more fuzzy matching algorithms may compare data in a many-to-many manner. The one or more fuzzy matching algorithms may process the received input data in order to determine whether there is a match or a near match (e.g., a “fuzzy match”) between the data representing ERP information and the transaction category data 508, customer name data 510, and/or invoice data 512. The one or more fuzzy matching algorithms may generate data representing an indication as to whether or not a match has been determined. The indication may comprise a binary indication as to whether or not a match has been determined and/or may comprise a confidence score representing a confidence level that a match has been determined.

At block 516, in some embodiments, the system may determine whether a match was determined at block 514. In some embodiments, the system may reference output data generated by the one or more fuzzy matching algorithms to determine whether a match was determined, for example by referencing whether a match is indicated by the output data on a binary basis. In some embodiments, the system may determine whether a match score generated at block 514 exceeds one or more predetermined or dynamically-determined threshold values in order to determine whether match criteria are met and thus whether a match is determined. In accordance with a determination that a match was determined, method 500 may proceed to blocks 518-538. In accordance with a determination that a match was not determined, method 500 may proceed to block 540 and onward.

Turning first to cases in which it is determined at block 516 that a match was determined, attention is drawn to block 518. At block 518, the system may determine whether the match that was determined is a one-to-one match. In some embodiments, the system may reference output data generated by the one or more fuzzy matching algorithms to determine whether the match that was determined is a one-to-one match. In accordance with a determination that the match that was determined is a one-to-one match, the method may proceed to one or both of blocks 510 and 524.

At block 520, in some embodiments, the system may apply a fuzzy comparison algorithm to data representing customer name information. In some embodiments, the system may compare customer name data in the data representing ERP information (received at block 502) to customer name data in the data representing one or more bank statements (received at block 504). The comparison of customer name data may generate output data comprising customer name match score 522, which may indicate an extent to which and/or a confidence with which the compared customer name data matches.

At block 524, in some embodiments, the system may apply a fuzzy comparison algorithm to data representing invoice information. In some embodiments, the system may compare invoice data in the data representing ERP information (received at block 502) to invoice data in the data representing one or more bank statements (received at block 504). The comparison of invoice data may generate output data comprising invoice match score 526, which may indicate an extent to which and/or a confidence with which the compared invoice data matches.

In some embodiments, the processes represented by blocks 518, 520, and 524 may be performed as follows. The system may test whether there is a match between data extracted from bank statements and ERP data for the following three attributes: we will need to test whether there is a match between the data extracted from the back statement and the ERP data for the following three attributes: fuzzy date comparison, where small deviations of date data between bank statements and ERP data may be considered acceptable; fuzzy customer name comparison, which may allow comparing normalized customer name data from bank statements (if present) with customer name data from ERP data; and invoice number comparison, where fuzzy invoice number comparison allows comparing invoice numbers between bank statement (if present). It should be noted that customer name and invoice number might not always be available in the bank statement data.

In some embodiments, one or more other component scores, aside from or in addition to a customer name match score and an invoice match score, may be computed.

In addition to or alternatively to customer name match score 522 and invoice match score 526, the system may generate data comprising temporal match score 528, for example by performing a fuzzy comparison of date data as shown at block 527. Temporal match score 528 may be computed based on a temporal difference (e.g., a number of days difference) in compared data. For example, the system may compare a date indicated in the data representing ERP information (received at block 502) to a date indicated in the data representing one or more bank statements (received at block 504), and may generate temporal match score 528 based on the difference between the two compared dates.

Following generation of component scores including for example customer name match score 522, invoice match score 526, and/or temporal match score 528, the system may generate an overall match score and/or an overall confidence score based on the component scores.

At block 532, in some embodiments, the system may compute overall match score 534. Computation of overall match score 534 may comprise applying an averaging algorithm (e.g., averaging non-zero component scores), for example by computing a weighted or unweighted average of one or more underlying component scores. In some embodiments, overall match score 534 may be computed as a the sum of three terms: a weighted fuzzy date comparison score (e.g., weighted 528), a weighted fuzzy customer name comparison score (e.g., weighted 522), and a weighted fuzzy invoice number comparison score (e.g., weighted 526). Computing an additive overall match score 534 may mean the overall match score 534 is higher when it is based on a comparison of more (e.g., all three) underlying terms than when it is not.

At block 536, in some embodiments, the system may compute overall confidence score 538. Computation of overall confidence score 538 may comprise applying an algorithm based one or more underlying confidence scores, such as confidence scores associated with one or more of underlying component scores. In some embodiments, a highest underlying confidence score may be selected as overall confidence score 538. In some embodiments, a lowest underlying confidence score may be selected as overall confidence score 538. In some embodiments, a weighted or unweighted average of underlying confidence scores may be computed as overall confidence score 538. In some embodiments, a product based on underlying confidence scores may be computed as overall confidence score 538.

Overall match score 534 and/or overall confidence score 538 may be stored, transmitted, presented to a user, used to generate one or more visualizations, and/or used to trigger one or more automated system actions.

Turning now to cases in which it is determined at block 516 that a match was not determined, attention drawn to block 540. At block 540, in some embodiments, the system may apply one or more amount matching algorithms, for example including one or more optimization algorithms that have been proposed to solve the Knapsack problem. The one or more amount matching algorithms may accept input data including (but not limited to) data representing ERP information from block 502, transaction category data 508, customer name data 510, and/or invoice data 512. The one or more amount matching algorithms may compare data in a one to many manner. The one or more amount matching algorithms may compare data from one bank transaction (e.g., data received at block 504) to data for many vouchers (e.g., data received at block 502). The one or more amount matching algorithms may process the received input data in order to determine whether there is a match between the data representing ERP information and the transaction category data 508, customer name data 510, and/or invoice data 512. The one or more amount matching algorithms may generate data representing an indication as to whether or not a match has been determined. The indication may comprise a binary indication as to whether or not a match has been determined and/or may comprise a confidence score representing a confidence level that a match has been determined.

At block 542, in some embodiments, the system may determine whether a match was determined at block 540. In some embodiments, the system may reference output data generated by the one or more amount matching algorithms to determine whether a match was determined, for example by referencing whether a match is indicated by the output data on a binary basis. In some embodiments, the system may determine whether a match score generated at block 540 exceeds one or more predetermined or dynamically-determined threshold values in order to determine whether match criteria are met and thus whether a match is determined. In accordance with a determination that a match was determined, method 500 may proceed to blocks 544-564. In accordance with a determination that a match was not determined, method 500 may proceed to block 566 and onward.

At block 544, in some embodiments, the system may select a candidate subset of data from the data received at block 502 and/or the data received at block 504. The analysis performed at blocks 546-564 may be performed with respect to the selected candidate subset of data. In some embodiments, to perform candidate subset selection, the system may identify a set of bank transactions that may be a match, and may then assess each item in the subset to determine which is the best match. In some embodiments, candidate subsets may include different numbers of items in the candidate subset. For example, one candidate subsets may be “three transactions that may match to a voucher,” while another candidate subset may be “two transactions that may match to a voucher.”

In some embodiments, candidate subset selection may proceed as follows: candidates may be sorted from largest to smallest; then those items in the sorted list that are already larger than the target may be eliminated, and only those which are smaller than or equal to the target amount are retained; then, a total amount from all of the remaining items may be computed, and those that match the target may be identified. In some embodiments, an overall objective may include determining whether the amount C from payment is a match to two or more elements among {A1, A2, A3}. If A1, A2, A3, have been sorted from largest to smallest, then it may be necessary to test whether

C=A1+A2; or

C=A2+A3; or

C=A1+A2+A3.

Thus, if A1 is known to be larger than C, then other additive combinations that include A1 may be known to be larger than C, and thus may not need to be tested, and the only remaining possibility that may need to be tested is whether C=A2+A3.

Based on the selected candidate subset, the system may generate one or more component scores, such as component scores 548, 552, and/or 556 described below.

At block 546, in some embodiments, the system may apply one or more subset match score algorithms to the selected candidate subset of data, thereby generating subset match score 548, which may indicate an extent to which and or a confidence by which two or more components (e.g., data points) of the selected subset match with one another. Block 546 may compare a voucher amount to a bank amount. Block 546 may compare an amount appearing in the data received at block 502 to an amount appearing in the data received at block 504.

At block 550, in some embodiments, the system may apply one or more fuzzy name comparison algorithms to the selected candidate subset of data, thereby generating customer name match score 552, which may indicate an extent to which and or a confidence by which two or more customer names in the selected subset match with one another. Block 550 may compare a customer name in voucher data with a customer name in statement data. Block 550 may compare a customer name appearing in the data received at block 502 to a customer name appearing in the data received at block 504.

At block 554, in some embodiments, the system may apply one or more fuzzy invoice comparison algorithms to the selected candidate subset of data, thereby generating invoice match score 556, which may indicate an extent to which and or a confidence by which two or more invoices in the selected subset match with one another. Block 554 may compare two instances of invoice data to one another. Block 550 may compare invoice data appearing in the data received at block 502 to invoice data appearing in the data received at block 504.

Following generation of component scores including for example subset match score 548, customer name match score 552, and/or invoice match score 556, the system may generate an overall match score and/or an overall confidence score based on the component scores.

At block 558, in some embodiments, the system may compute overall match score 560. Computation of overall match score 560 may comprise applying an averaging algorithm (e.g., averaging non-zero component scores), for example by computing a weighted or unweighted average of one or more underlying component scores.

At block 562, in some embodiments, the system may compute overall confidence score 564. Computation of overall confidence score 564 may comprise applying an algorithm based one or more underlying confidence scores, such as confidence scores associated with one or more of underlying component scores. In some embodiments, a highest underlying confidence score may be selected as overall confidence score 564. In some embodiments, a lowest underlying confidence score may be selected as overall confidence score 564. In some embodiments, a weighted or unweighted average of underlying confidence scores may be computed as overall confidence score 564. In some embodiments, a product based on underlying confidence scores may be computed as overall confidence score 564.

Overall match score 560 and/or overall confidence score 564 may be stored, transmitted, presented to a user, used to generate one or more visualizations, and/or used to trigger one or more automated system actions.

Turning now to cases in which it is determined at block 542 that a match was not determined, attention drawn to block 564. At block 564, in some embodiments, the system may determine that an overall match score is 0. The overall match score of 0 may be stored, transmitted, presented to a user, used to generate one or more visualizations, and/or used to trigger one or more automated system actions.

In some embodiments, the system may be configured to apply a plurality of different algorithms (e.g., two different algorithms, three different algorithms, etc.) as part of a payment vouching process. In some embodiments, the algorithms may be applied in parallel. In some embodiments, the algorithms may be applied in series. In some embodiments, the algorithms may be applied selectively dependent on the outcome of one another; for example, the system may first apply one algorithm and then may apply another algorithm selectively dependent on the outcome of the first algorithm (e.g., whether or not a match was indicated by the first algorithm). In some embodiments the system may be configured to apply a waterfall algorithm, a fuzzy date-amount algorithm, and an optimization algorithm that has been proposed to solve the Knapsack problem.

Computer

FIG. 6 illustrates an example of a computer, according to some embodiments. Computer 600 can be a component of a system for providing an AI-augmented auditing platform including techniques for providing AI-explainability for processing data through multiple layers. In some embodiments, computer 600 may execute any one or more of the methods described herein.

Computer 600 can be a host computer connected to a network. Computer 600 can be a client computer or a server. As shown in FIG. 6, computer 600 can be any suitable type of microprocessor-based device, such as a personal computer, workstation, server, or handheld computing device, such as a phone or tablet. The computer can include, for example, one or more of processor 610, input device 620, output device 630, storage 640, and communication device 660. Input device 620 and output device 630 can correspond to those described above and can either be connectable or integrated with the computer.

Input device 620 can be any suitable device that provides input, such as a touch screen or monitor, keyboard, mouse, or voice-recognition device. Output device 630 can be any suitable device that provides an output, such as a touch screen, monitor, printer, disk drive, or speaker.

Storage 640 can be any suitable device that provides storage, such as an electrical, magnetic, or optical memory, including a random access memory (RAM), cache, hard drive, CD-ROM drive, tape drive, or removable storage disk. Communication device 660 can include any suitable device capable of transmitting and receiving signals over a network, such as a network interface chip or card. The components of the computer can be connected in any suitable manner, such as via a physical bus or wirelessly. Storage 640 can be a non-transitory computer-readable storage medium comprising one or more programs, which, when executed by one or more processors, such as processor 610, cause the one or more processors to execute methods described herein.

Software 650, which can be stored in storage 640 and executed by processor 610, can include, for example, the programming that embodies the functionality of the present disclosure (e.g., as embodied in the systems, computers, servers, and/or devices as described above). In some embodiments, software 650 can include a combination of servers such as application servers and database servers.

Software 650 can also be stored and/or transported within any computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch and execute instructions associated with the software from the instruction execution system, apparatus, or device. In the context of this disclosure, a computer-readable storage medium can be any medium, such as storage 640, that can contain or store programming for use by or in connection with an instruction execution system, apparatus, or device.

Software 650 can also be propagated within any transport medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch and execute instructions associated with the software from the instruction execution system, apparatus, or device. In the context of this disclosure, a transport medium can be any medium that can communicate, propagate, or transport programming for use by or in connection with an instruction execution system, apparatus, or device. The transport-readable medium can include but is not limited to, an electronic, magnetic, optical, electromagnetic, or infrared wired or wireless propagation medium.

Computer 600 may be connected to a network, which can be any suitable type of interconnected communication system. The network can implement any suitable communications protocol and can be secured by any suitable security protocol. The network can comprise network links of any suitable arrangement that can implement the transmission and reception of network signals, such as wireless network connections, T1 or T3 lines, cable networks, DSL, or telephone lines.

Computer 600 can implement any operating system suitable for operating on the network. Software 650 can be written in any suitable programming language, such as C, C++, Java, or Python. In various embodiments, application software embodying the functionality of the present disclosure can be deployed in different configurations, such as in a client/server arrangement or through a Web browser as a Web-based application or Web service, for example.

Following is a list of enumerated embodiments:

- Embodiment 1. A system for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, the system comprising one or more processors configured to cause the system to:
- receive data representing an ERP item;
- generate hypothesis data based on the received data represent an ERP item;
- receive an electronic document;
- extract ERP information from the document;
- apply a first set of one or more models to the hypothesis data and to extracted ERP information in order to generate first output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item;
- apply a second set of one or more models to the extracted ERP information in order to generate second output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; and
- generate combined determination data, based on the first output data and the second output data, indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.
- Embodiment 2. The system of embodiment 1, wherein extracting the ERP information comprises generating first data representing information content of the ERP information and second data representing a document location for the ERP information
- Embodiment 3. The system of any one of embodiments 1-2, wherein the ERP information comprises one or more of: a purchase order number, a customer name, a date, a delivery term, a shipping term, a unit price, and a quantity.
- Embodiment 4. The system of any one of embodiments 1-3, wherein applying the first set of one or more models to generate output data is based on preexisting information regarding spatial relationships amongst instances of ERP information in documents.
- Embodiment 5. The system of embodiment 4, wherein the preexisting information comprises a graph representing spatial relationships amongst instances of ERP information in documents.
- Embodiment 6. The system of any one of embodiments 1-5, wherein the one or more processors are configured to cause the system to augment the hypothesis data based on one or more models representing contextual data.
- Embodiment 7. The system of embodiment 6, wherein the contextual data comprises information regarding one or more synonyms for the information content of the ERP information.
- Embodiment 8. The system of any one of embodiments 1-7, wherein the ERP information comprises a single word in the document.
- Embodiment 9. The system of any one of embodiments 1-8, wherein the ERP information comprises a plurality of words in the document.
- Embodiment 10. The system of any one of embodiments 1-9, wherein the second output data comprises one or more of:
- a confidence score indicating a confidence level as to whether the extracted ERP information constitutes vouching evidence for the ERP item;
- a binary indication as to whether the extracted ERP information constitutes vouching evidence for the ERP item; and
- a location within the electronic document corresponding to the determination as to whether the extracted ERP information constitutes vouching evidence for the ERP item.
- Embodiment 11. The system of embodiment 1, wherein generating the second output data comprises generating a similarity score representing a comparison of the ERP information and the ERP item.
- Embodiment 12. The system of embodiment 11, wherein the similarity score is generated based on an entity graph representing contextual data.
- Embodiment 13. The system of any one of embodiments 1-12, wherein extracting the ERP information from the document comprises applying a fingerprinting operation to determine, based on the receive data representing an ERP item, a characteristic of a data extraction operation to be applied to the electronic document.
- Embodiment 14. The system of any one of embodiments 1-13, wherein applying the second set of one or more models is based at least in part on contextual data.
- Embodiment 15. The system of any one of embodiments 1-14, wherein applying the second set of one or more models comprises:
- applying a set of document processing pipelines in parallel to generate a plurality of processing pipeline output data;
- applying one or more data normalization operations to the plurality of processing pipeline output data to generate normalized data; and
- generating the second output data based on the normalized data.
- Embodiment 16. A non-transitory computer-readable storage medium storing instructions for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, the instructions configured to be executed by a system comprising one or more processors to cause the system to:
- receive data representing an ERP item;
- generate hypothesis data based on the received data represent an ERP item;
- receive an electronic document;
- extract ERP information from the document;
- apply a first set of one or more models to the hypothesis data and to extracted ERP information in order to generate first output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item;
- apply a second set of one or more models to the extracted ERP information in order to generate second output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; and
- generate combined determination data, based on the first output data and the second output data, indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.
- Embodiment 17. A method for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, wherein the method is performed by a system comprising one or more processors, the method comprising:
- receiving data representing an ERP item;
- generating hypothesis data based on the received data represent an ERP item;
- receiving an electronic document;
- extracting ERP information from the document;
- applying a first set of one or more models to the hypothesis data and to extracted
- ERP information in order to generate first output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item;
- applying a second set of one or more models to the extracted ERP information in order to generate second output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; and
- generating combined determination data, based on the first output data and the second output data, indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.
- Embodiment 18. A system for verifying an assertion against a source document, the system comprising one or processors configured to cause the system to:
  - receive first data indicating an unverified assertion;
- receive second data comprising a plurality of source documents;
- apply one or more extraction models to extract a set of key data from the plurality of source documents; and
  - apply one or more matching models to compare the first data to the set of key data to generate an output indicating whether one or more of the plurality of source documents satisfies one or more verification criteria for verifying the unverified assertion.
- Embodiment 19. The system of embodiment 18, wherein the one or more extraction models comprise one or more machine learning models.
- Embodiment 20. The system of any one of embodiments 18-19, wherein the one or more matching models comprises one or more approximation models.
- Embodiment 21. The system of any one of embodiments 18-20, wherein the one or more matching models are configured to perform one-to-many matching between the first data and the set of key data.
- Embodiment 22. The system of any one of embodiments 16-21, wherein the one of more processors are configured to cause the system to modify one or more of the extraction models without modification of one or more of the matching models.
- Embodiment 23. The system of any one of embodiments 18-22, wherein the one of more processors are configured to cause the system to modify one or more of the matching models without modification of one or more of the extraction models.
- Embodiment 24. The system of any one of embodiments 18-23, wherein the unverified assertion comprises an ERP payment entry.
- Embodiment 25. The system of any one of embodiments 18-24, wherein the plurality of source documents comprises a bank statement.
- Embodiment 26. The system of any one of embodiments 18-25, wherein applying one or more matching models comprises generating a match score and generating a confidence score.
- Embodiment 27. The system of any one of embodiments 18-26, wherein applying one or more matching models comprises: applying a first matching model;
- if a match is indicated by the first matching model, generating a match score and a confidence score based on the first matching model;
- if a match is not indicated by the second matching model:
  - applying a second matching model; and
  - if a match is indicated by the second matching model, generating a match score and a confidence score based on the second matching mode; and
  - if a match is not indicated by the second matching model, generating a match score of 0.
- Embodiment 28. A non-transitory computer-readable storage medium storing instructions for verifying an assertion against a source document, the instructions configured to be executed by a system comprising one or processors to cause the system to:
  - receive first data indicating an unverified assertion;
- receive second data comprising a plurality of source documents;
- apply one or more extraction models to extract a set of key data from the plurality of source documents; and
  - apply one or more matching models to compare the first data to the set of key data to generate an output indicating whether one or more of the plurality of source documents satisfies one or more verification criteria for verifying the unverified assertion.
- Embodiment 29. A method for verifying an assertion against a source document, wherein the method is executed by a system comprising one or processors, the method comprising:
  - receiving first data indicating an unverified assertion;
- receiving second data comprising a plurality of source documents;
- applying one or more extraction models to extract a set of key data from the plurality of source documents; and
  - applying one or more matching models to compare the first data to the set of key data to generate an output indicating whether one or more of the plurality of source documents satisfies one or more verification criteria for verifying the unverified assertion.

This application incorporates by reference the entire contents of the U.S. patent application titled “AI-AUGMENTED AUDITING PLATFORM INCLUDING TECHNIQUES FOR AUTOMATED ADJUDICATION OF COMMERCIAL SUBSTANCE, RELATED PARTIES, AND COLLECTABILITY”, filed Jun. 30, 2022, Attorney Docket no. 13574-20069.00.

This application incorporates by reference the entire contents of the U.S. patent application titled “AI-AUGMENTED AUDITING PLATFORM INCLUDING TECHNIQUES FOR APPLYING A COMPOSABLE ASSURANCE INTEGRITY FRAMEWORK”, filed Jun. 30, 2022, Attorney Docket no. 13574-20070.00.

This application incorporates by reference the entire contents of the U.S. patent application titled “AI-AUGMENTED AUDITING PLATFORM INCLUDING TECHNIQUES FOR AUTOMATED DOCUMENT PROCESSING”, filed Jun. 30, 2022, Attorney Docket no. 13574-20071.00.

This application incorporates by reference the entire contents of the U.S. patent application titled “AI-AUGMENTED AUDITING PLATFORM INCLUDING TECHNIQUES FOR PROVIDING AI-EXPLAINABILITY FOR PROCESSING DATA THROUGH MULTIPLE LAYERS”, filed Jun. 30, 2022, Attorney Docket no. 13574-20072.00.

Claims

1. A system for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, the system comprising one or more processors configured to cause the system to:

receive data representing an ERP item;

generate hypothesis data based on the received data represent an ERP item;

receive an electronic document;

extract ERP information from the document;

apply a first set of one or more models to the hypothesis data and to extracted ERP information in order to generate first output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item;

apply a second set of one or more models to the extracted ERP information in order to generate second output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; and

generate combined determination data, based on the first output data and the second output data, indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.

2. The system of claim 1, wherein extracting the ERP information comprises generating first data representing information content of the ERP information and second data representing a document location for the ERP information

3. The system of claim 1, wherein the ERP information comprises one or more of: a purchase order number, a customer name, a date, a delivery term, a shipping term, a unit price, and a quantity.

4. The system of claim 1, wherein applying the first set of one or more models to generate output data is based on preexisting information regarding spatial relationships amongst instances of ERP information in documents.

5. The system of claim 4, wherein the preexisting information comprises a graph representing spatial relationships amongst instances of ERP information in documents.

6. The system of claim 1, wherein the one or more processors are configured to cause the system to augment the hypothesis data based on one or more models representing contextual data.

7. The system of claim 6, wherein the contextual data comprises information regarding one or more synonyms for the information content of the ERP information.

8. The system of claim 1, wherein the ERP information comprises a single word in the document.

9. The system of claim 1, wherein the ERP information comprises a plurality of words in the document.

10. The system of claim 1, wherein the second output data comprises one or more of:

a confidence score indicating a confidence level as to whether the extracted ERP information constitutes vouching evidence for the ERP item;

a binary indication as to whether the extracted ERP information constitutes vouching evidence for the ERP item; and

a location within the electronic document corresponding to the determination as to whether the extracted ERP information constitutes vouching evidence for the ERP item.

11. The system of claim 1, wherein generating the second output data comprises generating a similarity score representing a comparison of the ERP information and the ERP item.

12. The system of claim 11, wherein the similarity score is generated based on an entity graph representing contextual data.

13. The system of claim 1, wherein extracting the ERP information from the document comprises applying a fingerprinting operation to determine, based on the receive data representing an ERP item, a characteristic of a data extraction operation to be applied to the electronic document.

14. The system of claim 1, wherein applying the second set of one or more models is based at least in part on contextual data.

15. The system of claim 1, wherein applying the second set of one or more models comprises:

applying a set of document processing pipelines in parallel to generate a plurality of processing pipeline output data;

applying one or more data normalization operations to the plurality of processing pipeline output data to generate normalized data; and

generating the second output data based on the normalized data.

16. A non-transitory computer-readable storage medium storing instructions for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, the instructions configured to be executed by a system comprising one or more processors to cause the system to:

receive data representing an ERP item;

generate hypothesis data based on the received data represent an ERP item;

receive an electronic document;

extract ERP information from the document;

apply a first set of one or more models to the hypothesis data and to extracted ERP information in order to generate first output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item;

apply a second set of one or more models to the extracted ERP information in order to generate second output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; and

generate combined determination data, based on the first output data and the second output data, indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.

17. A method for determining whether data within an electronic document constitutes vouching evidence for an enterprise resource planning (ERP) item, wherein the method is performed by a system comprising one or more processors, the method comprising:

receiving data representing an ERP item;

generating hypothesis data based on the received data represent an ERP item;

receiving an electronic document;

extracting ERP information from the document;

applying a first set of one or more models to the hypothesis data and to extracted ERP information in order to generate first output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item;

applying a second set of one or more models to the extracted ERP information in order to generate second output data indicating whether the extracted ERP information constitutes vouching evidence for the ERP item; and

generating combined determination data, based on the first output data and the second output data, indicating whether the extracted ERP information constitutes vouching evidence for the ERP item.