Inside the BoxIn describing predictive coding systems, it’s important to distinguish document-based systems from corpus-based systems. Document-based systems make their predictions based on the similarity of each document to a single, previously-categorized document. Corpus-based systems are, in addition, able to use higher-order properties of groups of previously-categorized documents to make their predictions. Because of this advantage, corpus-based systems are less affected by errors in coding individual documents.

We already have terms to describe the analytic engines under the hood: k-nearest-neighbor, support vector machines, naive Bayes, etc. We also distinguish sampling strategies: random, relevance, uncertainty, etc. See, e.g., The Grossman-Cormack Glossary of Technology-Assisted Review, 2013 Fed. Cts. L. Rev. 7 (January 2013), available at

Maura R. Grossman of Wachtell, Lipton and Gordon V. Cormack of the University of Waterloo have articulated a taxonomy of ediscovery review methods leading up to, and including, predictive coding systems. Grossman, Maura R., and Cormack, Gordon V. (May 2015). A Tour of Technology-Assisted Review. Draft Submitted for publication in Perspectives on Predictive Coding and Other Advanced Search and Review Technologies for the Legal Practitioner (Chicago, Ill.: American Bar Association). Retrieved from

The Grossman-Cormack taxonomy identifies three core characteristics of predictive coding systems: whether they are “passive” or “active,” whether they use “simple” or “continuous” learning strategies, and whether the initial training documents are selected randomly, judgmentally, or by a hybrid method. Active systems provide the reviewers with potential training documents selected using machine intelligence, whereas passive systems do not. In simple systems, there is a training phase followed by a review phase. In continuous systems, the training phase and the review phases are merged. See also Cormack and Grossman, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014 (finding that Continuous Active Learning (“CAL”) systems are superior to Simple Passive Learning and Simple Active Learning systems).

I propose an additional way of categorizing predictive coding systems: as either document-based or corpus-based. As discussed below, I believe that document-based systems are more affected by errors in coding individual documents.

The following describes these two distinct types of systems based on commercially successful systems in use today.

Document-Based Systems

Cappadocia 024 nevitBy Himynameishummingbird (Own work) [CC BY-SA 3.0 (], via Wikimedia CommonsBy Nevit Dilmen (Own work) [GFDL ( or CC BY-SA 4.0-3.0-2.5-2.0-1.0 (], via Wikimedia Commons
In some systems, a designated senior attorney begins by carefully identifying examples of what she deems to be representative responsive and unresponsive documents (“seed documents”). The analytic engine breaks down each seed document into discrete features. It then finds which documents in the remaining unreviewed documents have features in common with each individual seed document. Reviewers then review and categorize these similar documents. The engine then performs the same analysis on these first round documents, and so on.

And so, in this method, an initial seed document becomes the trunk of a tree. The first-round documents are like branches on this tree. The second round creates sub-branches on those branches, and so on.

In such systems, a misjudgment about a single document can foreclose examination of an entire branch and its sub-branches. In the language of chaos theory, document reviews in these systems have “a very high degree of sensitivity to initial conditions and to the way in which they are set in motion.”

(That is not to say that such a coding error would necessarily preclude review of all documents on that branch. Some of those documents could fortuitously be found on other branches, or on other trees.)

Corpus-Based Systems

from - R. Buckminster Fuller holds up a Tensegrity sphere. 18th April, 1979

Corpus-based predictive coding systems have a fundamental advantage over document-based systems. Instead of predicting based only on the features of an individual document, corpus-based systems can use any feature that a document-based system can use, and many others as well. That’s because sets of documents have measurable emergent properties that individual documents do not, such as the predominance of a particular feature within a set or subset, the amount of overall similarity between the documents in a set or subset, and richness.

Because corpus-based systems can predict based on many features of sets of document that document-based systems cannot, they are less affected by individual coding errors.

For example, suppose that, in a document-based responsiveness review, a coding error precludes review of a responsive document. A corpus-based review could find that document by other means, such as its similarity to the overall set of responsive documents or some subset.

Also, using corpus-wide analytics, corpus-based systems can locate early errors by finding which early judgments appear to be statistical outliers in light of the weight of all later judgments about documents in that category. These outliers can then be re-evaluated. These outliers can include seed documents or any other training documents.

As a consequence, corpus-based systems can be more resilient and robust than document-based systems. They can tolerate individual errors much more than document-based systems.

To Be Continued

with a later post on the ramifications for the use of seed sets, training sets, and control sets.