Abstract: Data files are associated with categories by processing said data files in combination with outline files. Large files (221) are divided into a plurality of file sections (223) each having a size substantially consistent with a preferred size. Each of the file sections is categorised (224) and the sets of associations are processed (225) to produce a set of category associations for the original undivided file (221).
Abstract: Documents are classified into one or more clusters corresponding to predefined classification categories by building a knowledge base comprising matrices of vectors which indicate the significance of terms within a corpus of text formed by the documents and classified in the knowledge base to each cluster. The significance of terms is determined assuming a standard normal probability distribution, and terms are determined to be significant to a cluster if their probability of occurrence being due to chance is low. For each cluster, statistical signatures comprising sums of weighted products and intersections of cluster terms to corpus terms are generated and used as discriminators for classifying documents. The knowledge base is built using prefix and suffix lexical rules which are context-sensitive and applied selectively to improve the accuracy and precision of classification.
Abstract: Structured graphical data is reorganised. The data, which may be defined in accordance with portable document format (PDF) includes graphical object definitions and references to said definitions. The data is reorganised so that the graphical object references are preceded by their respective object definitions.