Abstract: The invention provides a method, apparatus and system for classification and clustering electronic data streams such as email, images and sound files for identification, sorting and efficient storage. The method further utilizes learning machines in combination with hashing schemes to cluster and classify documents. In one embodiment hash apparatuses and methods taxonomize clusters. In yet another embodiment, clusters of documents utilize geometric hash to contain the documents in a data corpus without the overhead of search and storage.
Abstract: A method of optimizing regular expressions including determining an optimized form for regular expressions and presenting the optimized forms for the regular expressions to a user in a source-level representation. A system is provided for authoring regular expressions including a user interface enabling a user to author a regular expression defining a particular text pattern. The user interface enables the user to specify a target data set and a matching algorithm to be used with the regular expression. An optimizer implements transformation rules and processes for applying the transformation rules to an authored regular expression to generate an optimized regular expression presented in a source-level representation. The optimizer may select an alternate preferred pattern matching algorithm and an alternate preferred data source making adjustments to the pattern.