Understanding Legal Tech Transformation in Document Review
Recent LinkedIn data shared with TIME magazine showed that references to “generative AI” appeared in posts 36 times more often in 2023 than in 2022. 2023 was undoubtedly the year that generative AI became mainstream for the general business world.
For the legal sector, firms and corporate legal departments alike are now beginning to embrace the idea of this new technology to streamline legal processes and optimise outputs, particularly in the areas of document review, compliance and eDiscovery. But lawyers rushing to embrace these new cutting-edge capabilities, keen to get ahead of the competition, must remember that the human component is just as vital to get right as the technology itself.
Money and time investment is required in both, to make sure the people working alongside the technology understand the new requirement to engage with it, how it impacts their role and how their daily working practices and routines need to change. Without this emphasis on the people element, law firms and in-house teams risk wasting money on expensive, flashy tech that does not translate to the benefits that investment in such cutting-edge technology promises.
The evolution of technologies around eDiscovery illustrates the point well – and the more lawyers understand it, the better they are able to work with it, to the benefit of their firms and organisations. So here are some practical points about how the technology works, to help lawyers involved in the e-Discovery process develop this understanding.
The Technology Works
As the volume of data used in corporate activities continues to increase at an accelerating pace, and as new data preservation tools are developed, the volume of data involved in the eDiscovery process is multiplying. Consequently, many organisations are concerned about the cost of document review. For more than a decade, companies and legal professionals have been using Technology Assisted Review (TAR) and advanced technologies, i.e. a way of reviewing documents using machine algorithms to classify documents based on human inputs.
Outlined below is a summary of how technology has evolved from Sample Based Learning to Continuous Active Learning, and how the use of Artificial Intelligence (AI) in document review has become increasingly streamlined.
While TAR has been part of the eDiscovery industry for almost twenty years, the operational method behind TAR is not standardised due to the following challenges:
- Development of new tools. eDiscovery tools, including forensic tools for collecting data, analytic and TAR tools, and platforms for reviewing documents, are regularly being developed and improved upon. As these tools get adopted, even newer advanced tools continue to be further developed to improve efficiency and accuracy. What may be considered best practice at one time may quickly be superseded as tools are enhanced and new tools are invented.
- Standards can be different by region. As the use of TAR and analytic technology varies worldwide, the ways TAR is utilised depend on the country or the case. For example, in Japan, most lawyers who have not worked on foreign litigation or regulatory investigations may be unfamiliar with TAR. While there is room to utilise TAR for document review internal investigation matters, not all lawyers and in-house counsel are willing to adopt new technology.
Due to rapid changes in advanced eDiscovery technologies (including the use of AI), experts in the field can provide valuable insight on today’s best practices and how to tailor them to a project.
What Technology is Currently Used?
Due to the complexity of the use of AI technologies in eDiscovery, it is beneficial for lawyers and in-house counsel to understand the concepts. With such knowledge, they can direct external eDiscovery experts in achieving the overall objectives, focusing on designing and troubleshooting aspects of a workflow. Here are some general TAR concepts:
- Sample Based Training. An early form of TAR is Sample Based Training, sometimes referred to as TAR 1.0, simple passive learning (SPL), or simple active learning (SAL). This process starts with a training round called the “control set,” where a reviewer knowledgeable about the case reviews a few thousand randomly sampled documents. The control set is used as a representative sample of the review pool to benchmark the model at each step. Additional random samples called the “training sets” are then reviewed. The training sets are used to train the model, which classifies all the documents. The model’s predictions are then checked in a “QC round,” metrics such as precision, recall, and overturns are assessed, and the performance of the model is validated, or additional review rounds are completed until the model is validated.
- Continuous Active Learning. Continuous Active Learning is sometimes called TAR 2.0. In this method, there is no separation between training and review because the AI actively learns. An active learning review usually requires a certain number of documents to be binary-coded before the AI propagates the relevance scores. Once a suitable number of documents are manually coded as relevant and a suitable number of documents are manually coded as not relevant, the model will begin training and continuously update the relevance scores based on documents coded by reviewers while serving up documents with the highest relevance scores to the front of the review queue. Each coded document can help improve the accuracy of the model’s predictions. Like Sample Based Learning, Continuous Active Learning typically ends with a validation test where a sample set of unreviewed documents is manually reviewed to estimate how many, if any, relevant documents would be missed.
Each method has its pros and cons – for example, Sample Based Learning can effectively predict the whole review scope at an earlier stage. Continuous Active Learning may generate smaller review populations because of its continuous model update. One of the advantages of Continuous Active Learning is that it can be set up relatively easily and may be helpful for the following purposes:
- Reducing the Scope. One way of using Continuous Active Learning is to reduce the review scope by eliminating non-relevant documents. As set out above, the AI model assigns a relevance score to each document and serves the documents to the human review team for manual review. After the model is stabilised, the reviewer conducting the manual review may decide to review only documents above a certain relevance score and conduct a sample review of the non-reviewed documents to assess the accuracy of the prediction. This way, it can reduce the review cost and time by attempting to remove most presumably non-relevant documents from manual review.
- Prioritisation. Once the AI model has enough documents to stabilise its judgment, it will prioritise documents it deems more likely relevant among the remaining documents for human review. In other words, as the model is being updated, documents with a high relevance score are prioritised for review. This prioritisation is especially useful in detecting hot documents at an early stage and is often utilised in internal investigations where the overall picture of the case may not be clear at the beginning.
- Quality Control. The Continuous Active Learning tool can be used for quality control during and after the linear review. One way this can be done is to train the AI with the coding of a smaller group of reviewers (typically the team leads or more experienced reviewers) within the team, and instead of having the AI continuously automatically update the relevance scores, a review manager will instruct the AI to update the model from time to time or at the end of the linear review. A document with a high relevance score but marked as not relevant during the linear review or a document with a low relevance score but tagged as relevant during the linear review would constitute a part of the quality control scope.
Accordingly, the main benefits of utilising Continuous Active learning are 1) reducing cost; and 2) pursuing accuracy. One can either continue to review documents until the machine stops serving up likely relevant documents or manually stop reviewing when satisfactory results are obtained. Either way, time and cost are reduced by eliminating a large part of the review pool from manual review. Alternatively, one can also manually review the entire pool of documents and use Continuous Active Learning to prioritise likely relevant documents.
Continuous Active Learning also has some limitations:
- Base Volume. TAR methodologies in general, and Continuous Active Learning in particular, may be less effective for small volumes of documents. While using TAR for a small document population is possible, the resulting cost and time savings would likely be insignificant compared to a manual review of the entire scope.
- Document Type. Certain types of documents may be unsuitable for the active learning process. For example, as Continuous Active Learning is built based on the textual information of each document, documents without textual information, such as images, audio files, and movies, are not suitable for the active learning process. Another example of a document type that may often be excluded from the active learning process is Excel spreadsheets containing mostly numbers. Documents that are excluded from the Continuous Active Learning process must be manually reviewed separately.
- Document Family. A “document family” is a group of associated files (e.g., an email and its attachments belong to the same document family). In lawsuits and regulatory investigations, a party may often be requested to produce documents on a family basis. Accordingly, linear document review is often conducted on a family basis. However, while using Continuous Active Learning, the relevance score is assigned on an individual document basis. While a reviewer may access and review the family members of a document on the review platform, the efficiency of the review may be compromised in this way. While some Continuous Active Learning tools can present entire families in the review queue, the AI’s relevance prediction is still being made on a document-by-document basis.
TAR and other advanced technologies offer possible solutions when dealing with the ever-increasing volume of electronic data. The decision of whether one should use Continuous Active Learning depends on many factors, such as whether the data involved are suitable and how Continuous Active Learning should be utilised to meet review needs. Such decision requires insight into the technology and experience with actual cases. Trained professionals can assist law firms and in-house departments in designing a predictive coding workflow strategy and utilising technology – and, crucially, retraining the human lawyers who work alongside it – to achieve the best practice.