Abstract: A Smart Optical Character Recognition (SOCR) Trainer comprises software developed for automating Quality Control (QC) using unsupervised machine-learning techniques to analyze, classify, and optimize textual data extracted from an image or PDF document. SOCR Trainer serves as a ‘data treatment’ utility service that can be embedded into data processing workflows (e.g., data pipelines, ETL processes, data versioning repositories, etc.). SOCR Trainer performs a series of automated tests on the quality of images and their respective extracted textual data to determine if the extraction is trustworthy. If deficiencies are detected, SOCR Trainer will analyze certain parameters of the document, perform conditional optimizations, re-perform text extraction, and repeat QA testing until the output meets desired specifications. SOCR Trainer will produce audit files recording the provenance and differences between original documents and enhanced optimized document text.
Abstract: A Smart Optical Character Recognition (SOCR) Trainer comprises software developed for automating Quality Control (QC) using unsupervised machine-learning techniques to analyze, classify, and optimize textual data extracted from an image or PDF document. SOCR Trainer serves as a ‘data treatment’ utility service that can be embedded into data processing workflows (e.g., data pipelines, ETL processes, data versioning repositories, etc.). SOCR Trainer performs a series of automated tests on the quality of images and their respective extracted textual data to determine if the extraction is trustworthy. If deficiencies are detected, SOCR Trainer will analyze certain parameters of the document, perform conditional optimizations, re-perform text extraction, and repeat QA testing until the output meets desired specifications. SOCR Trainer will produce audit files recording the provenance and differences between original documents and enhanced optimized document text.