Abstract: A computer implemented method for extracting meaningful text from a document of unknown or unspecified format. In a particular embodiment, the method includes reading the document, thereby to extract raw encoded text, analysing the raw encoded text, thereby to identify one or more text chunks, and for a given chunk, performing compression identification analysis to determine whether compression is likely and, in the event that compression. The method can further include performing a decompression process, performing an encoding identification process thereby to identify a likely character encoding protocol, and converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text.
Type:
Application
Filed:
March 22, 2012
Publication date:
March 28, 2013
Applicant:
ISYS Search Software Pty Ltd.
Inventors:
Scott Coles, Derek Murphy, Ben Truscott, Ian Davies