SYSTEM AND METHODS FOR COMPUTERIZED INFORMATION GOVERNANCE OF ELECTRONIC DOCUMENTS

Info

Publication number: 20140207786
Type: Application
Filed: Oct 24, 2013
Publication Date: Jul 24, 2014
Applicant: EQUIVIO LTD. (Rosh HaAyin)
Inventors: Liad TAL-ROTHSCHILD (Givataim), Yiftach RAVID (Rosh HaAyin), Amir MILO (Kfar Saba), Warwick SHARP (Raanana), Theresa Beaumont (San Francisco, CA)
Application Number: 14/062,233

Abstract

An information governance system comprising a plurality of classifiers which employ cutoffs for classifying at least a portion of a population of incoming documents as documents to be retained and documents to be discarded in accordance with a corresponding plurality of pre-defined retention schedules; training apparatus for training said classifiers based on relevance inputs provided by a human information governance expert regarding a training set of documents within a universe of documents to be governed; and apparatus operative to automatically cause any classified document to be retained and subsequently discarded in accordance with its pre-defined retention schedule including discarding only documents that (a) have been classified as documents to be discarded and (b) have not been classified as documents to be retained, and to automatically cause any document which could not be classified, to be retained as gray area data until further notice.

Description

Description

FIELD OF THIS DISCLOSURE

The present invention relates generally to computerized processing of electronic documents and more particularly to implementing document retention policies on electronic documents.

BACKGROUND FOR THIS DISCLOSURE

US20070283410, assigned to IBM, describes a system for managing data located on networked devices, comprising an information manager for replicating objects residing on the devices, collecting information about at least one of the objects or the devices, and receiving input on desired information governance policies and outcomes; an information analyzer for analyzing the replicated objects; and an action module for determining an information governance action based on the collected information, the received input, and the analysis of the replicated objects.

USSN 20120215749 describes prior art policy development software applications which “focus only on record retention and disposition policies in a single jurisdiction. Entities seeking to manage their records in complex and multijurisdictional environments may need to limit their IG programs to a single, main operating jurisdiction and to only retention and disposition policies. Such limitations may result in some aspects of IG other than retention and disposition, and other jurisdictions, to remain outside the scope of governance programs.”

RSD GLASS® Policy Manager claims to be the first information governance policy engine to break through various constraints and, regardless of whether “you have a very complex organizational structure, broadly distributed content repositories, or a highly sophisticated information classification plan”, to provide “the policy creation and management collaboration you need to ensure consistent compliance with laws, regulations, and business requirements”.

Digital Thread technology claims to use “patented algorithms” to allow users to track and manage enterprise documents moving through disparate systems including email, hard drives, shared drives. Also, in 2012, Proofpoint announced acquisition of NextPage, and an information governance solution available based on the NextPage technology:

Nuix claims to provide “proactive management of data as part of an information governance strategy, aimed at uncovering business value as well as for the purely reactive business of giving discovery” and has indicated in statements that “the US patent office has recognised that Nuix has core functions which are unique and is due to issue a patent in respect of those functions.”

The disclosures of all publications and patent documents mentioned in the specification, and of the publications and patent documents cited therein directly or indirectly, are hereby incorporated by reference. Materiality of such publications and patent documents to patentability is not conceded.

SUMMARY OF CERTAIN EMBODIMENTS

Certain embodiments of the present invention seek to provide apparatus for computerized information governance of electronic documents operative for computerized determination of document relevance to one or more document retention categories.

Certain embodiments of the present invention seek to manage trade-off between junk (a.k.a. deletable, to be deleted, and similar) data and categories (such as financial reports, loan requests, regulations, HR) which need to be retained for a predefined period. Documents may fall into one of more of: categories to be retained, gray area, and junk. As for categories to be retained and junk, the precision may be computed/estimated, thereby to compute hence control false positive rates i.e. proportion of documents falling within a Junk category.

Certain embodiments of the present invention seek to provide Quality Assurance (also termed herein QC or quality control) including use of Logarithmic stratified sampling to compute statistical metrics (such as recall and precision) in cases of low or unknown richness. Specifically, rather than assessment by measuring recall and precision based on a random sample from input documents, the QA process may order the documents by their ranks then partition the ranks into: [0,p] [p, 2p], [2p, 4p], . . . , (where p is some small percentile, for example 0.01%) randomly selecting documents for each slice. An advantage is the ability to compute precision and estimate richness even when richness is very small. For example, for richness of 0.4% only 4 documents are relevant when 1000 documents are sampled at random.

Certain embodiments of the present invention seek to provide Stability criteria for low richness scenarios. Stability measures cannot be computed based on assessment since assessment typically is not practical in information governance due to low richness. Stability may however be based on Cross validation and the percentage of R (relevant) documents in the training set.

Certain embodiments of the present invention seek to provide a Cutoff based on Precision rather than, say, f-measure which is hard to compute accurately since Recall is hard to estimate when the Richness is small. In a quality control step, precision of each of several cutoffs may be estimated.

Certain embodiments of the present invention seek to use human-trained computerized classifiers (trained on the basis of training set/s of documents whose relevance to each of various information Governance categories is evaluated by a human expert) to categorize each document's information Governance category.

The present invention typically includes at least the following embodiments:

Embodiment 1

An information governance system comprising:

A plurality of classifiers which employ cutoffs for classifying at least a portion of a population of incoming documents as documents to be retained and documents to be discarded in accordance with a retention policy comprising a corresponding plurality of pre-defined retention schedules;

training apparatus for training said classifiers based on relevance inputs provided by a human information governance expert regarding a training set of documents within a universe of documents to be governed; and

retain/discard apparatus operative to automatically cause any classified document to be retained and subsequently discarded in accordance with its pre-defined retention schedule including discarding only documents that (a) have been classified as documents to be discarded and (b) have not been classified as documents to be retained, and to automatically cause any document which could not be classified, to be retained as gray area data until further notice.

Embodiment 2

A system according to Embodiment 1 and also comprising computerized apparatus for identifying at least one cluster of related documents within a set of “gray area” documents which could not be classified.

Embodiment 3

A system according to Embodiment 1 wherein at least one of the plurality of pre-defined retention schedules calls for documents to be discarded immediately.

Embodiment 4

A system according to Embodiment 1 wherein the training apparatus is operative to train each of said classifiers until a predetermined precision measure has been achieved.

Embodiment 5

A system according to Embodiment 1 and also comprising threshold adjustment functionality operative to quantify a false-negative error rate, resulting in premature discarding of documents, and, if excessive, to adjust at least one threshold employed by said classifiers accordingly.

Embodiment 6

A system according to Embodiment 5 wherein a false-negative error rate is deemed excessive if a pre-stored human categorizer's false-negative error rate is lower.

This error rate may be culled from professional literature. For examples, Forbes.com have stated that “69 percent of information in most companies has no business, legal or regulatory value.”

Embodiment 7

A system according to Embodiment 1 and also comprising identifying and discarding older near-duplicates of at least one retained document.

Embodiment 8

An information governance method comprising:

generating a plurality of classifiers for classifying electronic documents into a corresponding plurality of documentation retention categories:

running training iterations thereby to improve at least one of the plurality of classifiers;

classifying a repository of electronic documents using said plurality of classifiers and running a Logarithmic stratified sampling-based Quality Assurance process to compute precision in cases of low or unknown richness including ordering documents by their ranks then partitioning the ranks into slices: [0,p] [p, 2p], [2p, 4p], . . . , and randomly selecting documents to represent each slice, thereby to generate Quality Assurance results;

if the Quality Assurance results are not deemed good enough, improve the classifier and return to one of said running steps;

if the Quality Assurance results are good enough, use last classifier to implement a plurality of document retention settings corresponding to said plurality of documentation retention categories.

Embodiment 9

A method according to Embodiment 8 and also comprising:

using a processor for identifying at least one cluster of related documents within a set of “gray area” documents which could not be classified;

generating at least one additional classifier for classifying electronic documents into each of at least one cluster of related documents:

repeating said classifying step using both the plurality of classifiers and the at least one additional classifier thereby to reduce percentage of “gray area” documents; and

implementing document retention settings including:

- settings corresponding to said plurality of documentation retention categories and
- settings corresponding to each of said at least one cluster of related documents.

Embodiment 10

A method according to Embodiment 8 wherein for each individual classifier, stability measures are computed based on cross validation and percentage of relevant documents in the individual classifier's training set.

Embodiment 11

A method according to Embodiment 9 wherein said using a processor for identifying comprises using Equivio Themes functionality for identifying at least one cluster of related documents within a set of “gray area” documents which could not be classified.

Embodiment 12

A system according to Embodiment 1 and also comprising a rule repository operative to map said retention policy to retention time and wherein said rules are accessed by said retain/discard apparatus.

Embodiment 13

A method according to Embodiment 8 wherein if the Quality Assurance results are not deemed good enough for an individual classifier, the individual classifier is improved by adding documents used in said quality assurance process to the individual classifier's training set thereby to generate an expanded training set and re-training the individual classifier using the expanded training set.

Embodiment 14

A method according to Embodiment 8 wherein said plurality of documentation retention categories includes at least one category of documents to be retained and at least one category of documents to be immediately discarded.

Embodiment 15

A method according to Embodiment 14 wherein for each classifier from among said plurality of classifiers which corresponds to a category of documents to be retained, high and low cutoff points are set.

Embodiment 16

A method according to Embodiment 15 wherein for each classifier from among said plurality of classifiers which corresponds to a category of documents to be immediately discarded, just one cutoff is set.

Embodiment 17

A method according to Embodiment 15 wherein an individual document is:

retained if the individual document's relevance exceeds the high cutoff of a retention category to which said individual document belongs, and

discarded if the document's relevance both falls above a cutoff of a category of documents to be immediately discarded to which the individual document belongs and falls below all low cutoff points of all retention categories.

Embodiment 18

A method according to Embodiment 17 wherein the individual document is retained as a gray area document if the individual document is not discarded and if the individual document falls below the high cutoff of the retention category to which said individual document belongs.

Embodiment 19

A computer program product, comprising a non-transitory tangible computer readable medium having computer readable program code embodied therein, said computer readable program code adapted to be executed to implement an information governance method comprising:

generating a plurality of classifiers for classifying electronic documents into a corresponding plurality of documentation retention categories;

running training iterations thereby to improve at least one of the plurality of classifiers;

classifying a repository of electronic documents using said plurality of classifiers and running a Logarithmic stratified sampling-based Quality Assurance process to compute precision in cases of low or unknown richness including ordering documents by their ranks then partitioning the ranks into slices: [0,p] [p, 2p], [2p, 4p], . . . , and randomly selecting documents to represent each slice, thereby to generate 30 Quality Assurance results;

if the Quality Assurance results are not deemed good enough, improve the classifier and return to one of said running steps;

if the Quality Assurance results are good enough, use last classifier to implement a plurality of document retention settings corresponding to said plurality of documentation retention categories.

Also provided, excluding signals, is a computer program comprising computer program code means for performing any of the methods shown and described herein when said program is run on a computer; and a computer program product, comprising a typically non-transitory computer-usable or -readable medium e.g. non-transitory computer -usable or -readable storage medium, typically tangible, having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement any or all of the methods shown and described herein. It is appreciated that any or all of the computational steps shown and described herein may be computer-implemented. The operations in accordance with the teachings herein may be performed by a computer specially constructed for the desired purposes or by a general purpose computer specially configured for the desired purpose by a computer program stored in a typically non-transitory computer readable storage medium. The term “non-transitory” is used herein to exclude transitory, propagating signals or waves, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application.

Any suitable processor, display and input means may be used to process, display e.g. on a computer screen or other computer output device, store, and accept information such as information used by or generated by any of the methods and apparatus shown and described herein; the above processor, display and input means including computer programs, in accordance with some or all of the embodiments of the present invention. Any or all functionalities of the invention shown and described herein, such as but not limited to steps of flowcharts, may be performed by a conventional personal computer processor, workstation or other programmable device or computer or electronic computing device or processor, either general-purpose or specifically constructed, used for processing; a computer display screen and/or printer 30 and/or speaker for displaying; machine-readable memory such as optical disks, CDROMs, DVDs, BluRays, magnetic-optical discs or other discs; RAMs, ROMs, EPROMs. EEPROMs, magnetic or optical or other cards, for storing, and keyboard or mouse for accepting. The term “process” as used above is intended to include any type of computation or manipulation or transformation of data represented as physical, e.g. electronic, phenomena which may occur or reside e.g. within registers and/or memories of a computer or processor. The term processor includes a single processing unit or a plurality of distributed or remote such units.

The above devices may communicate via any conventional wired or wireless digital communication means, e.g. via a wired or cellular telephone network or a computer network such as the Internet.

The apparatus of the present invention may include, according to certain embodiments of the invention, machine readable memory containing or otherwise storing a program of instructions which, when executed by the machine, implements some or all of the apparatus, methods, features and functionalities of the invention shown and described herein. Alternatively or in addition, the apparatus of the present invention may include, according to certain embodiments of the invention, a program as above which may be written in any conventional programming language, and optionally a machine for executing the program such as but not limited to a general purpose computer which may optionally be configured or activated in accordance with the teachings of the present invention. Any of the teachings incorporated herein may whereever suitable operate on signals representative of physical objects or substances.

The embodiments referred to above, and other embodiments, are described in detail in the next section.

Any trademark occurring in the text or drawings is the property of its owner and occurs herein merely to explain or illustrate one example of how an embxodiment of the invention may be implemented.

Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions, utilizing terms such as, “processing”, “computing”, “estimating”, “selecting”, “ranking”, “grading”, “calculating”, “determining”, “generating”, “reassessing”, “classifying”, “generating”, “producing”, “stereo-matching”, “registering”, “detecting”, “associating”, 30 “superimposing”, “obtaining” or the like, refer to the action and/or processes of a computer or computing system, or processor or similar electronic computing device, that manipulate and/or transform data represented as physical, such as electronic, quantities within the computing system's registers and/or memories, into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices. The term “computer” should be broadly construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, personal computers, servers, computing system, communication devices, processors (e.g. digital signal processor (DSP), microcontrollers, field programmable gate array (FPGA), application specific integrated circuit (ASIC), etc.) and other electronic computing devices.

The present invention may be described, merely for clarity, in terms of terminology specific to particular programming languages, operating systems, browsers, system versions, individual products, and the like. It will be appreciated that this terminology is intended to convey general principles of operation clearly and briefly, by way of example, and is not intended to limit the scope of the invention to any particular programming language, operating system, browser, system version, or individual product.

Elements separately listed herein need not be distinct components and alternatively may be the same structure.

Any suitable input device, such as but not limited to a sensor, may be used to generate or otherwise provide information received by the apparatus and methods shown and described herein. Any suitable output device or display may be used to display or output information generated by the apparatus and methods shown and described herein. Any suitable processor may be employed to compute or generate information as described herein e.g. by providing one or more modules in the processor to perform functionalities described herein. Any suitable computerized data storage e.g. computer memory may be used to store information received by or generated by the systems shown and described herein. Functionalities shown and described herein may be divided between a server computer and a plurality of client computers. These or any other computerized components shown and described herein may communicate between themselves via a suitable computer network.

The term cutoff refers to a rank suitable for partitioning a set of documents ranked for relevance (above the cutoff), into two, relevant (above cutoff) and non-relevant (below the cutoff), subsets. More than one cutoff may be provided if it is desired to have more and less lenient criteria for relevance.

Richness refers to a percentage of relevant documents in a collection of electronic documents to be classified.

Stability refers to a state of training of a classifier, in which adding more documents to the training set does not yield substantial improvement of the classifier's quality measure (e.g. f-measure, precision, accuracy).

BRIEF DESCRIPTION OF THE DRAWINGS

Certain embodiments of the present invention are illustrated in the following drawings:

FIG. 1 is a computerized information governance method provided in accordance with one embodiment of the present invention.

FIG. 2 is a graph of cutoff points suitable for executing computerized information governance methods described herein, in accordance with certain embodiments of the present invention.

FIGS. 3a-3c, taken together, form a simplified flowchart illustration of a computerized information governance method provided in accordance with another embodiment of the present invention which may be a particular instance of the method of FIG. 1.

Computational components described and illustrated herein can be implemented in various forms, for example, as hardware circuits such as but not limited to custom VLSI circuits or gate arrays or programmable hardware devices such as but not limited to FPGAs, or as software program code stored on at least one tangible or intangible computer readable medium and executable by at least one processor, or any suitable combination thereof. A specific functional component may be formed by one particular sequence of software code, or by a plurality of such, which collectively act or behave or act as described herein with reference to the functional component in question. For example, the component may be distributed over several code sequences such as but not limited to objects, procedures, functions, routines and programs and may originate from several computer files which typically operate synergistically.

Data can be stored on one or more tangible or intangible computer readable media stored at one or more different locations, different network nodes or different storage devices at a single node or location.

It is appreciated that any computer data storage technology, including any type of storage or memory and any type of computer components and recording media that retain digital data used for computing for an interval of time, and any type of information retention technology, may be used to store the various data provided and employed herein. Suitable computer data storage or information retention apparatus may include apparatus which is primary, secondary, tertiary or off-line; which is of any type or level or amount or category of volatility, differentiation, mutability, accessibility, addressability, capacity, performance and energy use; and which is based on any suitable technologies such as semiconductor, magnetic, optical, paper and others.

DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS

A computerized information governance system is provided which includes some or all of: a stored plurality of rules that map document retention policy to retention time; a plurality of categories used in the retention policy; a corresponding plurality of classifiers trained for those categories respectively; and a method which includes (a) a quality control process applied to the classifiers until same satisfy some condition; if the condition is not satisfied for an individual classifier, the individual classifier may be re-trained e.g. using some of the QC documents; (b) For each classifier corresponding to a category of documents to be retained high and low cutoff points are set as opposed to documents to be immediately discarded for which just one cutoff may be set; (c) assign a document to an individual retention category if the document's relevance exceeds the category's high cutoff, and (d) assign a document to an individual Junk category if the document's relevance is above the Junk cutoff and below all low cutoff points of all retention categories.

FIG. 1 is a computerized information governance method provided in accordance with one embodiment of the present invention. The method may include some or all of the following steps, suitably ordered e.g. as shown:

10. generate classifier/s e.g. conventionally or as per “method 23” herein for determining relevance of documents to be governed, to respective document retention category/ies

20. Run training iterations, e.g. conventionally or as per “method 23” herein thereby to improve the classifier

30. Run a QC round. A (typically human) expert has tagged documents that were selected using the following method: Sampling documents for Quality Assurance (also termed herein QC or quality control) may include Logarithmic stratified sampling to compute precision in cases of low or unknown richness. e.g. order the documents by their ranks then partition the ranks into: [0,p] [p, 2p], [2p, 4p], . . . , (where p is some small percentile, for example 0.01%), randomly selecting documents for each slice. Precision of each of several cutoffs may be estimated.

50. If the QC results computed in step 40 are deemed not good enough, typically due to failure to clear one or more predetermined hurdles, improve (e.g. re-train) the classifier e.g. using the tagged QC documents and return to step 30 (another QC round) or to step 20 (more training)

60. Otherwise, if the QC results are good enough, terminate classifier generation process and use last classifier to enforce document retention regime in accordance with given information governance policy

Typically, relevance in this connection refers to applicability of each of the documents to at least one document retention category from among typically 100, 200 or more such categories.

FIGS. 3a-3c, taken together, form a simplified flowchart illustration of a computerized information governance method provided in accordance with another embodiment of the present invention. The method may include some or all of the following steps, suitably ordered e.g. as shown:

305: set up: provide population of documents; define document retention regime in accordance with given info governance policy, generate settings: set a retention period for each of multiple document types. For example CVs from Europe—save 10 years; CVs from the US—save 5 years, Audit documents—save forever.

307: Based on the settings define classification categories for which to build classifiers, e.g. CV and Audit. Country of origin (Europe, US) may be available meta data or more generally, classification categories are sometimes not identical to settings in retention policy due to use of available metadata to partially define settings e.g. as in the example above in which some settings are a logical combination of meta-data values and classification category of a document.

Example: retention policy states that “international emails that were sent to management must be retained for 5 years”. To enforce this retention policy:

a. create a classifier that identifies “international emails”, and test classifier e.g. as described herein
b. identify “International emails” using classifier generated in step a
c. from among “International emails” identified in step a, use metadata (“sent” field) to identify emails sent to management.

310. generate N+M classifiers e.g. conventionally or as per “method 23” below, for computerized classification of the population of documents into N category/ies of documents to be retained and M category/ies of Junk documents (i.e. each of the classification categories identified in step 307).

320. For each of the N+M classifiers, run training iterations, e.g. conventionally or as per the Equivio Relevance functionality, commercially available from Equivio, and improve each classifier accordingly.

325. stop training and proceed to a QC round (step 330) if classifier stability is observed. Criteria for stability may include some or all of:

- a. richness (e.g. % of relevant documents) in the training set, in excess of a predetermined threshold.
- b. At each iteration, when the classifier is built, a list of “important features” is identified including keywords or phrases whose appearance or prevalence in the document are indicative of document relevance/irrelevance. Each classifier, in each iteration, may generate a classifier-specific, iteration-specific list. When this list of keywords stops changing, e.g., when a difference between a classifier's list for a current iteration and a classifier's list for a previous iteration falls below a certain threshold), this is indicative of stability. Distance may be determined using any suitable metric such as but not limited to % of words common to both lists. For example, if list 1 is {“red”, “yellow”, “blue”} and list 2 is {“red”, “green”, “blue” }, then the lists are 2/3=66% similar.

c. At each iteration, the classifier may be built using a possibly different (iteration-specific) internal configuration (e.g. SVM kernel; where kernel methods including support vector machine (SVM) kernel methods are known for pattern analysis purposes). The classifier generation method typically automatically selects the configuration that works best. When this decision stops changing—e.g. for predetermined parameters X, Y, it is true that the current configuration was also selected in Y out of the last X previous iterations, this is indicative of stability. X and Y are any suitable parameters, for example: X=10 and Y=8.

330. Run a QC (quality control) round for each classifier. Typically, the system 15 generates a QC sample whose size may vary from a few hundred to a few thousand documents. A human expert tags the sample e.g. indicates which documents belong to which categories. The system compares the tags with the classifier results and presents QC results (e.g. precision value/s) and, optionally, recommended next steps e.g. either continue the process or stop, as described in step 350.

Wikipedia defines the precision for a class in a classification task as “the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance. Hligh recall means that an algorithm returned most of the relevant results, while high precision means that an algorithm returned substantially more relevant results than irrelevant. Precision is thus a measure of the amount of noise in document-retrieval.”

Example criteria to decide what to recommend, some or all of which may be employed, include:

- i. Evaluation of overall performance including how successful the classifier was at identifying relevant documents. The evaluation may for example identify that the current precision/recall values fail to exceed a predetermined threshold.
- ii. Monitor quality control results over time e.g. by comparing current QC results to results of previous QC/training rounds. If there is little or no improvement, over time, the system may recommend that there is no point continuing. If there is significant improvement, over time, the system may recommend continuing to improve e.g. train the processor.
- iii. Search for a good cutoff e.g. a cutoff point where the precision (or recall) drops dramatically indicating a “separation point” between relevant and non-relevant documents. If no such cutoff point is found, more training might be needed.

350. for each of the N+M classifiers, if the QC results are not good enough, perform step 360. Else, i.e. if QC results are good enough, do steps 365, 370a, 370b and 370c. Any suitable computerized method may be employed for determining whether or not QC results are good enough, e.g. using precision, as computed from the QC process, as a criterion. For example, there may be a threshold or cutoff score at which the classifier's precision measure begins to decrease dramatically. It is appreciated that the quality measure may be based on Recall rather than, or in addition to, precision. An example QC procedure takes a sample from a top (as ranked by classifier) 0.01% (say) of the documents, then samples from the next (say) 0.01%, then 0.02%, 0.04%, 0.08%, . . . , 10%. Typically due to small richness there will be a dramatic fall in precision.

360. If the QC results for any of the N+M classifiers are not deemed good enough, improve (e.g. retrain) the classifier using the QC results (e.g. add tagged document from the QC round to the classifier's training set and/or use seeds) and return to step 330 (another QC round) or to step 320 (more training). Typically, the documents from the QC process are added to the training set and the classifier is re-trained using the thereby expanded training set. Typically although not necessarily, the Classifier needs to be improved (e.g. trained or re-trained) either because there are not enough samples or because a topic is “hard to classify”.

365. Set different cutoffs for N>=1 category/ies of documents to be retained and for M>=1 category/ies of Junk documents to be discarded. Typically, for each classifier from among said plurality of classifiers which corresponds to a category of documents to be retained, high and low cutoff points are set. Cutoffs are based, typically, on QC results. For example, for each point sampled logarithmically, the precision may be known, and the cutoff may be selected from among the sampling points or by interpolating, e.g. linearly, therebetween, to yield a desired precision value.

370A. terminate classifier generation process and use last classifier generated by the classifier generation process to enforce the document retention regime by deleting each individual document classified as belonging to an individual category of junk only if the individual document does not also belong to any one of the category/ies of documents to be retained.

A particular advantage of this step which ensures documents classified as discardable junk, but also classified as belonging to a non-junk category to be retained, are not discarded, is that the precision of the junk determination is increased and the number of improper deletion errors is decreased.

370B. may identify and discard older near-duplicates of at least one retained document, using any suitable method for identification of “families” of near-duplicates. For example, methods of near duplicate identification may be conventional e.g. may employ Equivio near-duplicate functionality commercially available from Equivio and/or may be as described in Published PCT Application WO 2006/008733 and corresponding U.S. Pat. No. 8,015,124, both entitled “A Method for Determining Near Duplicate Data Objects”; and in Published PCT Application WO 2007/086059 and corresponding U.S. Pat. No. 8,391,614, both entitled “Determining Near Duplicate “Noisy” Data Objects”. Documents typically are associated with meta data indicating a time at which the document was most recently modified (“last modified time”). A pre-programmed policy for step 370b may be “always retain the latest file”, in which case the most recently modified file is retained and all or some older files are removed.

370c. may use computerized “themes” method to explore a gray area of documents assigned to none of the above categories and to suggest candidate categories within the gray area. A user interface is typically provided enabling an information Governor user to view the candidate categories, identify a subset of these as useful document retention categories, and set a retention period for each category in the subset. New classifiers may be generated for each category in the subset. This is advantageous for diminishing the proportion of documents assigned to the undesirable gray area. Information Governor may alternatively or in addition create new categories or rules on the fly.

Any suitable process for building a classifier may be used herein e.g. for step 10 of FIG. 1. One suitable process may utilize the technology of commercially available Equivio Relevance technology and/or may employ the following “method 23” which is an electronic document analysis method receiving N electronic documents pertaining to a case encompassing a set of issues including at least one issue and establishing relevance of at least the N documents to at least one individual issue in the set of issues, the method comprising, for at least one individual issue from among the set of issues. Method 23 typically includes some or all of the following steps 3010, 3020 etc., suitably ordered e.g. as follows:

Step 3010: receive an output of a categorization process applied to each document in training and control subsets of the N documents, the process optionally having been performed by a human operator, the output including, for each document in the subsets, one of a relevant-to-the-individual issue indication and a non-relevant-to-the-individual issue indication.

Step 3020: build a text classifier simulating the categorization process using the output for all documents in the training subset of documents.

Step 3030: evaluate the quality of the text classifier based on the output for all documents in the control subset of documents.

Step 3040: run the text classifier on the N documents thereby to obtain a ranking of the extent of relevance of each of the N documents to the individual issue.

Step 3050: partition the N documents into uniformly ranked subsets of documents, the uniformly ranked subsets differing in ranking of their member documents by the text classifier and adding more documents from each of the uniformly ranked subsets to the training subset.

Step 3060: order the documents in the control subset in an order determined by the rankings obtained by running the text classifier.

Step 3070: select a rank e.g. document in the control subset which when used as a cutoff point for bimarizing the rankings in the control subset, maximizes a quality criterion.

Step 3080: using the cutoff point, compute and store at least one quality criterion e.g. F measure characterizing the binarizing of the rankings of the documents in the control subset, thereby to define a quality of performance indication of a current iteration 1.

Step 3090: display a comparison of the quality of performance indication of the current iteration 1 to quality of performance indications of previous iterations e.g. by generating at least one graph of at least one quality criterion vs. iteration serial number.

Step 3100: seek an input (e.g. a user input received from a human user and/or a computerized input including a computerized indication of flatness of the graph of at least one quality criterion vs. iteration serial number) as to whether or not to return to the receiving step thereby to initiate a new iteration I+1 which comprises the receiving, building, running, partitioning, ordering, selecting, and computing/storing steps and initiate the new iteration I+1 if and only if so indicated by the input, wherein the iteration I+1 may use a control subset larger than the control subset of iteration I and may include the control subset of iteration I merged with an additional group of documents of pre-determined size randomly selected from the collection of documents.

Step 3110: run the text classifier most recently built on at least the collection of documents thereby to generate a final output and generating a computer display of the final output e.g. a histogram of ranks for each issue and/or a function of an indication of a quality measure (e.g. F measure; precision: recall), such as a graph of the quality 20 measure as a function of cutoff point, for each of a plurality of cutoff points and optionally a culling percentage including an integral of the graph.

A suitable data structure for implementing “method 23” may be stored in a relational database and on the file system. The tables in the database may include some or all of:

a. a Document table storing for each document a “Document key”, and an internal docID.
b. an Issue Table storing, for each issue, the issue name, issue ID, and Stability(issue) computed as described herein.
c. DocumentIssue table: a table with three columns: docID, issueID, and rank. Each row represents an individual docID which belongs to a certain issue, issueID, and has a particular rank as indicated by the corresponding classifier.
d. Classifier table: Each classifier has a unique ID, associated by the table with the issue the classifier was built for.
e. ClassifierDocuments table having 3 columns; classifierID, docID, docType. DocType can be either “train as positive example” or “train as negative example”, control.
f. Parameter table that holds all the parameters in the system. S

The “themes” method employed in step 70c of FIG. 1 may for example be performed as follows:

When enhancing expert-based computerized analysis of a set of digital documents, a system for computerized derivation of leads from a huge body of data may be provided, the system comprising:

an electronic repository including a multiplicity of accesses to a respective multiplicity of electronic documents and metadata including metadata parameters having metadata values characterizing each of said multiplicity of electronic documents;

a relevance rater using a processor to run a first computer algorithm on said multiplicity of electronic documents which yields a relevance score which rates relevance of each of said multiplicity of electronic documents to an issue; and

a metadata-based relevant-irrelevant document discriminator using a processor to rapidly run a second computer algorithm on at least some of said metadata which yields leads, each lead comprising at least one metadata value for at least one metadata parameter, which value correlates with relevance of said electronic documents to the issue.

The application is operative to find outliers of a given metadata and relevancy score (i.e. relevant, not relevant). When theme-exploring is used, the system can identify, themes with high relevancy score based on the given application. The above system, without theme-exploring, may compute the outlier for a given metadata, and each document appears one in each metadata. In the Theme-exploring settings for a given set of themes the same document might fall in several of the metadata.

Themes in e-dscovery—METHOD:
step 1. Input: a set of electronic documents. The documents could be in:
Text format, Native files (PDF, Word, PPT, etc.), ZIP files, PST, Lotus notes, MSG, etc.
Step 2: Extract text from the data collection. Text extraction can be done by third party software such as: Oracle inside out, iSys, DTSearch, iFilter, etc.
Step 3: Compute Near-duplicate (ND) on the dataset.
The following teachings may be used: U.S. Pat. No. 8,015,124, entitled “A Method for Determining Near Duplicate Data Objects”; and/or WO 2007/086059, entitled “Determining Near Duplicate “Noisy” Data Objects” and/or suitable functionalities in commercially available e-discovery systems such as those of Equivio.
The output of this phase is: For each document compute the following:
Step 3a: DuplicateSubsetID: all documents having the same DuplicateSubsetID having an identical text.
Step 3b: EquiSetID: all documents having the same EquiSetID are similar (for each document x in the set there is another document y, such that the similarity between the two is less than some threshold).
Step 3c: Pivot: 1 if the document is a representative of the set (and 0 otherwise). The pivot document can be selected by a policy for example (maximum words number of words, minimum number of words, median number of wards, minimum docid, etc.) When using theme networking (TN) we recommend using maximum words in documents as pivot policy cause we would like largest documents to be in the model.
Step 4. Compute Email threads (ET) on the dataset. The following teachings may be used: WO 2009/004324, entitled “A Method for Organizing Large Numbers of Documents” and/or suitable functionalities in commercially available e-discovery systems such as those of Equivio.
The output of this phase is a collection of trees, and all leafs of the trees are marked as inclusive. Note, that family information is acceptable (to group e-mail with their attachments)
Step 5. Run a topic modeling algorithm (such as LDA) on a subset of the dataset, including feature extraction. Resulting topics are defined as themes. The subset includes the following documents:

- Inclusive from Email threads (ET)
- Pivots from all documents that are not e-mails. I.e. pivots from documents and attachments.

The data collection include less files (usually the size if 50% of the total size); and the data do not include similar document therefore is a document appears many times in the original data collection it will have the same weight as if it appears once.

In building the model documents are used with more than 25 (parameter) words and less than 20,000 words. The idea behind this limitation is to improve performance, and not be influenced by high words frequency when the document has few features.

If the dataset is huge at most 100,000 (parameter) documents are selected at random to build the model, and after building the model these are applied on all other documents.

The first step in topic modeling algorithm is to extract features from each document.

A method suitable for the Feature extraction of step 5 may include getting features as follows:

A topic modeling algorithm uses features to create the model for the topic-modelling step 5 above. The features are words;, to generate a list of words from each 15 word, one may do the following:

If the document is an e-mail, remove all e-mail headers in the document, but keep the subject line. One may multiply the subject line to set some weight to the subject words. Tokenize the text using separators such as, spaces, semicolon, colon, tabs, new line etc.
Ignore the following features:
Words with length less than 3 (parameter)
Words with length greater than 20 (parameter)
Words that do not start with an alpha character.
Words that are stop words.
Words that appears more than 0.2 times number of words in the document (parameter)
Words that appear in less than 0.01 times number of documents. (Parameter)
Words that appear in more than 0.2 times number of documents. (Parameter)
Stemming, part-of-speech—as features

Step 8. Theme names. The output of step 5 includes an assignment of documents to the themes, and an assignment of words (features) to themes. Each feature x has some probability P_xy of being in theme y. Using the P matrix, construct names to the themes.

In e-discovery one may use the following scenarios: Early Case Assessment. Post Case Assessment and provision of helpfid User Interfaces.

Early Case Assessment (some or all of steps a-h):

a. Select at random 100000 documents
b. Run near-duplicates (ND)
c. Run Email threads (ET)
d. Select pivot and inclusive
e. Run topic modeling using the above feature selection. The input of the topic modeling is a set of documents. The first phase of the topic modeling is to construct a set of features for each document. The feature getting method described above may be used to construct the set of features.
f. Run the model on all other documents (optional).
g. Generate theme names e.g. using step 8 above.
h. Explore the data by browsing themes; one may open a list of documents belonging to a certain theme, from the document one may see all themes connected to that document, and go to another theme.
The list of documents might be filtered by a condition set by the user. For example, filter all documents by dates, relevancy, file size, etc.

The above procedure assists users in early case assessment when the data is known and it the users wish to determine what is in the data, and to get an idea about the data collection.

In early case assessment one may randomly sample the dataset to obtain results more quickly.

Post Case Assessment

This process uses some or all of steps 1-5 above, but in this setting an entire dataset is not used, but rather, only the documents that are relevant to the case. If near-duplicates (ND) and Email threads (ET) have already run, there is no need to re-run them.

1^stpass review is a quick review of the documents that can be handled manually or by an automatic predictive coding software, the user wishes to review the results and get an idea on the themes of the documents that passed that review. This phase is essential because the number of such documents might be very great indeed. Also, there are cases in which some sub-issues on/v contain a few documents.

The above building block can generate a procedure for such cases. Here, only documents that passed the 1^streview phase are taken, and themes are computed for them.

User Interface using the output of steps 1-5 and displaying results thereof.
Upon running the topic modeling each resulting topic is defined as a theme, and display, for each theme, the list of documents that are related to that theme. The user has an option to select a meta-data (for example does the document have relevancy to an issue, custodian, date-range, file type, etc.) and the system will display, for each theme the percentage of that meta-data in that theme. Such presentation would assist the user while evaluating the theme.

An LDA model might have themes that can be classified as CAT_related and DOG_related. A theme has probabilities of generating various words, such as milk, meow, and kitten, which can be classified and interpreted by the viewer as “CAT_related” The word cat itself will have high probability, given this theme. The DOG_related theme likewise has probabilities of generating each word: puppy, bark, and bone might have high probability. Words without special relevance, such as the (see function word), will have roughly even probability between classes (or can be placed into a separate category). A theme is not strongly defined, neither semantically nor epistemologically. It is identified on the basis of supervised labeling and (manual; pruning on the basis of their likelihood of co-occurrence. A lexical word may occur in several themes with a different probability however, with a different typical set of neighboring words in each theme.

Each document is assumed to be characterized by a particular set of themes. This is akin to the standard bag of words model assumption, and makes the individual words exchangeable.

Processing a large data set requires time and space, in the context of the current invention N documents are selected to create the model, and then the model is applied on the remaining documents.

When selecting the documents to build the model there may be a few options:

- 1. Take all documents.
- 2. Take one document for each set of exact duplicate documents
- 3. Take one document from each EquiSet (e.g. as per U.S. Pat. No. 8,015,124, entitled “A Method for Determining Near Duplicate Data Objects”; and/or WO 2007/086059, entitled “Determining Near Duplicate “Noisy” Data Objects”).
  4. Take the inclusive from the data collection. (“Inclusive” may refer to an email that belongs to and also typically culminates an email thread e.g. includes an entire email conversation (e.g the original email, the “reply”, the next “reply”, the “forward”, etc.).

The idea behind 2, 3, 4 is to try and create themes that are known to the user, and also not to give weight to documents that already appear in a known set.

The input for the algorithm is a text document that can be parsed to a bag-of-words.
When processing an e-mail one may notice that the e-mail contains a header (From, to, (CC Subject); and a body. The body of an e-mail can be a formed by a series of e-mails.
For example:

From: A To: B Subject: CCXXCCC Body1 Body1

From: B

To: A

Subject: CCCCC

Body2 Body2

While processing e-mails for topic modeling one can consider removing all e-mail headers within the body, and by setting a weight to the subject by using multiple subject lines. In the above example the processed text would be:

CCCCC CCCCC CCCCC Body1 Body1 Body2 Body2

Step 8 (Theme names) is now described in detail:

Let P(w_i,t_j) the probability that the feature w_i belongs to theme t_j. In known implementations the theme name is a list of words with the highest probability. The solution is good when the dataset is sparse, i.e. the vocabulary of the themes is different from one other. In e-discovery the issues are highly connected and therefore, there are cases when the “top” words appeared in two or more themes. In settings of the problem “stable marriage” was used as an algorithm to pair words to themes. The algorithm may include:

Order the theme by some criteria (Size, Quality, # of relevant documents, etc.); i.e. theme_—3 is better than theme_—4.

(1) Create an empty set S

(2) Sort themes by some criteria

(3) For j=0; j<maximum words in theme name: j++

(4) For I=0;I<#rtmber of themes; i+) do
(5) Take theme_—1 assign the word with the highest score that is not in S

After X words are assigned for each theme, the number of words may be reduced by, for example, taking only those words in each theme that are bigger than the maximum word rank in that theme, divided by some constant.

According to certain embodiments of the present invention, cutoffs or thresholds for converting per-document relevance ranks into binary relevance data (each document being deemed either relevant to an information governance category or not relevant) may be set to ascertain that documents required to be retained will in fact be retained at least as reliably as they would have been retained under a scheme pre-known to be compliant with an information governance regime, e.g. a completely manual-document-categorization scheme. For example, a safety factor (scalar) may be selected to ensure that the probability that a document required to be retained is not in fact retained, due to use of computerized classifiers as described herein, is less (by the safety factor) than the probability that the same document required to be retained would not in fact have been retained, if a scheme pre-known to be compliant with an information governance regime, e.g. a manual-document-categorization scheme, had been used.

U.S. Pat. No. 8,346,685, entitled “A Computerized System For Enhancing Expert-Based Processes And Methods Useful In Conjunction Therewith” describes a method for receiving input from a plurality of experts, performing a computerized comparison of input received from the plurality of experts thereby to identify points of discrepancy, and selecting a subset of “better” computerized expert-based processes. This technology may be used to ensure, e.g. by suitable cutoff selection, that a new set of classifiers is at least as successful as a scheme pre-known to be compliant with an information governance regime.

The system may, if desired, be implemented as a web-based system employing software, computers, routers and telecommunications equipment as appropriate. Some or all functionalities e.g. software functionalities shown and described herein may be deployed in a cloud environment. Clients (who seek to manage their document repositories given information governance requirements) may be operatively associated with, but are external, to the cloud.

The methods shown and described herein are particularly useful in governing and implementing document retention requirements for bodies of knowledge including hundreds, thousands, tens of thousands, hundreds of thousands, millions, or hundreds of millions of electronic documents or other computerized information repositories, some or many of which are themselves at least tens or hundreds, or even thousands of pages long. This is because practically speaking, such large bodies of knowledge can only be processed, analyzed, sorted, or searched using computerized technology.

It is appreciated that terminology such as “mandatory”, “required”, “need” and “must” refer to implementation choices made within the context of a particular implementation or application described herewithin for clarity and are not intended to be limiting since in an alternative implantation, the same elements might be defined as not mandatory and not required or might even be eliminated altogether.

It is appreciated that software components of the present invention including programs and data may, if desired, be implemented in ROM (read only memory) form including CD-ROMs, EPROMs and EEPROMs, or may be stored in any other suitable typically non-transitory computer-readable medium such as but not limited to disks of various kinds, cards of various kinds and RAMs. Components described herein as software may, alternatively, be implemented wholly or partly in hardware and/or firmware, if desired, using conventional techniques. Conversely, components described herein as hardware may, alternatively, be implemented wholly or partly in software, if desired, using conventional techniques.

Included in the scope of the present invention, inter alia, are electromagnetic signals carrying computer-readable instructions for performing any or all of the steps or operations of any of the methods shown and described herein, in any suitable order S including simultaneous performance of suitable groups of steps as appropriate; machine-readable instructions for performing any or all of the steps of any of the methods shown and described herein, in any suitable order; program storage devices readable by machine, tangibly embodying a program of instructions executable by the machine to perform any or all of the steps of any of the methods shown and described herein, in any suitable order; a computer program product comprising a computer useable medium having computer readable program code, such as executable code, having embodied therein, and/or including computer readable program code for performing, any or all of the steps of any of the methods shown and described herein, in any suitable order; any technical effects brought about by any or all of the steps of any of the methods shown and described herein, when performed in any suitable order; any suitable apparatus or device or combination of such, programmed to perform, alone or in combination, any or all of the steps of any of the methods shown and described herein, in any suitable order; electronic devices each including a processor and a cooperating input device and/or output device and operative to perform in software any steps shown and described herein; information storage devices or physical records, such as disks or hard drives, causing a computer or other device to be configured so as to carry out any or all of the steps of any of the methods shown and described herein, in any suitable order; a program pre-stored e.g. in memory or on an information network such as the Internet, before or after being downloaded, which embodies any or all of the steps of any of the methods shown and described herein, in any suitable order, and the method of uploading or downloading such, and a system including server/s and/or client/s for using such; a processor configured to perform any combination of the described steps or to execute any combination of the described modules; and hardware which performs any or all of the steps of any of the methods shown and described herein, in any suitable order, either alone or in conjunction with software. Any computer-readable or machine-readable media described herein is intended to include non-transitory computer- or machine-readable media.

Any computations or other forms of analysis described herein may be performed by a suitable computerized method. Any step described herein may be computer-implemented. The invention shown and described herein may include (a) using a computerized method to identify a solution to any of the problems or for any of the objectives described herein, the solution optionally include at least one of a decision, an action, a product, a service or any other information described herein that impacts, in a positive manner, a problem or objectives described herein; and (b) outputting the solution.

The scope of the present invention is not limited to structures and functions specifically described herein and is also intended to include devices which have the capacity to yield a structure, or perform a function, described herein, such that even though users of the device may not use the capacity, they are, if they so desire, able to modify the device to obtain the structure or function.

Features of the present invention which are described in the context of separate embodiments may also be provided in combination in a single embodiment.

For example, a system embodiment is intended to include a corresponding process embodiment. Also, each system embodiment is intended to include a server-centered “view” or client centered “view”, or “view” from any other node of the system, of the entire functionality of the system, computer-readable medium, apparatus, including only those functionalities performed at that server or client or node.

Conversely, features of the invention, including method steps, which are described for brevity in the context of a single embodiment or in a certain order may be provided separately or in any suitable subcombination or in a different order. “e.g.” is used herein in the sense of a specific example which is not intended to be limiting. Devices, apparatus or systems shown coupled in any of the drawings may in fact be integrated into a single platform in certain embodiments or may be coupled via any appropriate wired or wireless coupling such as but not limited to optical fiber. Ethernet, Wireless LAN, HomePNA, power line communication, cell phone, PDA, Blackberry GPRS, Satellite including GPS, or other mobile delivery. It is appreciated that in the description and drawings shown and described herein, functionalities described or illustrated as systems and sub-units thereof can also be provided as methods and steps therewithin, and functionalities described or illustrated as methods and steps therewithin can also be provided as systems and sub-units thereof. The scale used to illustrate various elements in the drawings is merely exemplary and/or appropriate for clarity of presentation and is not intended to be limiting.

Claims

1. An information governance system comprising:

A plurality of classifiers which employ cutoffs for classifying at least a portion of a population of incoming documents as documents to be retained and documents to be discarded in accordance with a retention policy comprising a corresponding plurality of pre-defined retention schedules;

training apparatus for training said classifiers based on relevance inputs provided by a human information governance expert regarding a training set of documents within a universe of documents to be governed; and

retain/discard apparatus operative to automatically cause any classified document to be retained and subsequently discarded in accordance with its pre-defined retention schedule including discarding only documents that (a) have been classified as documents to be discarded and (b) have not been classified as documents to be retained, and to automatically cause any document which could not be classified, to be retained as gray area data until further notice.

2. A system according to claim 1 and also comprising computerized apparatus for identifying at least one cluster of related documents within a set of “gray area” documents which could not be classified.

3. A system according to claim 1 wherein at least one of the plurality of pre-defined retention schedules calls for documents to be discarded immediately.

4. A system according to claim 1 wherein the training apparatus is operative to train each of said classifiers until a predetermined precision measure has been achieved.

5. A system according to claim 1 and also comprising threshold adjustment functionality operative to quantify a false-negative error rate, resulting in premature discarding of documents, and, if excessive, to adjust at least one threshold employed by said classifiers accordingly.

6. A system according to claim 5 wherein a false-negative error rate is deemed excessive if a pre-stored human categorizer's false-negative error rate is lower.

7. A system according to claim 1 and also comprising identifying and discarding older near-duplicates of at least one retained document.

8. An information governance method comprising:

generating a plurality of classifiers for classifying electronic documents into a corresponding plurality of documentation retention categories;

running training iterations thereby to improve at least one of the plurality of classifiers;

classifying a repository of electronic documents using said plurality of classifiers and running a Logarithmic stratified sampling-based Quality Assurance process to compute precision in cases of low or unknown richness including ordering documents by their ranks then partitioning the ranks into slices: [0,p] [p, 2p], [2p, 4p],..., and randomly selecting documents to represent each slice, thereby to generate Quality Assurance results;

if the Quality Assurance results are not deemed good enough, improve the classifier and return to one of said running steps:

if the Quality Assurance results are good enough, use last classifier to implement a plurality of document retention settings corresponding to said plurality of documentation retention categories.

9. A method according to claim 8 and also comprising:

using a processor for identifying at least one cluster of related documents within a set of “gray area” documents which could not be classified;

generating at least one additional classifier for classifying electronic documents into each of at least one cluster of related documents; repeating said classifying step using both the plurality of classifiers and the at least one additional classifier thereby to reduce percentage of “gray area” documents; and implementing document retention settings including: settings corresponding to said plurality of documentation retention categories and settings corresponding to each of said at least one cluster of related documents.

10. A method according to claim 8 wherein for each individual classifier, stability measures are computed based on cross validation and percentage of relevant documents in the individual classifier's training set.

11. A method according to claim 9 wherein said using a processor for identifying comprises using Equivio Themes functionality for identifying at least one cluster of related documents within a set of “gray area” documents which could not be classified.

12. A system according to claim 1 and also comprising a rule repository operative to map said retention policy to retention time and wherein said rules are accessed by said retain/discard apparatus.

13. A method according to claim 8 wherein if the Quality Assurance results are not deemed good enough for an individual classifier, the individual classifier is improved by adding documents used in said quality assurance process to the individual classifier's training set thereby to generate an expanded training set and re-training the individual classifier using the expanded training set.

14. A method according to claim 8 wherein said plurality of documentation retention categories includes at least one category of documents to be retained and at least one category of documents to be immediately discarded.

15. A method according to claim 14 wherein for each classifier from among said plurality of classifiers which corresponds to a category of documents to be retained high and low cutoff points are set.

16. A method according to claim 15 wherein for each classifier from among said plurality of classifiers which corresponds to a category of documents to be immediately discarded, just one cutoff is set.

17. A method according to claim 15 wherein an individual document is:

retained if the individual document's relevance exceeds the high cutoff of a retention category to which said individual document belongs, and

discarded if the document's relevance both falls above a cutoff of a category of documents to be immediately discarded to which the individual document belongs and falls below all low cutoff points of all retention categories.

18. A method according to claim 17 wherein the individual document is retained as a gray area document if the individual document is not discarded and if the individual document falls below the high cutoff of the retention category to which said individual document belongs.

19. A computer program product, comprising a non-transitory tangible computer readable medium having computer readable program code embodied therein, said computer readable program code adapted to be executed to implement an information governance method comprising:

generating a plurality of classifiers for classifying electronic documents into a corresponding plurality of documentation retention categories;

running training iterations thereby to improve at least one of the plurality of classifiers;

classifying a repository of electronic documents using said plurality of classifiers and running a Logarithmic stratified sampling-based Quality Assurance process to compute precision in cases of low or unknown richness including ordering documents by their ranks then partitioning the ranks into slices: [0,p] [p, 2p], [2p, 4p],..., and randomly selecting documents to represent each slice, thereby to generate Quality Assurance results;

if the Quality Assurance results are not deemed good enough, improve the classifier and return to one of said running steps;

if the Quality Assurance results are good enough, use last classifier to implement a plurality of document retention settings corresponding to said plurality of documentation retention categories.