An Empirical Study on Transfer Learning for Privilege Review

Haozhen Zhao,Shi Ye,Jingchao Yang
DOI: https://doi.org/10.48550/arXiv.2112.08606
2021-12-16
Abstract:Protecting privileged communications and data from inadvertent disclosure is a paramount task in the US legal practice. Traditionally counsels rely on keyword searching and manual review to identify privileged documents in cases. As data volumes increase, this approach becomes less and less defensible in costs. Machine learning methods have been used in identifying privilege documents. Given the generalizable nature of privilege in legal cases, we hypothesize that transfer learning can capitalize knowledge learned from existing labeled data to identify privilege documents without requiring labeling new training data. In this paper, we study both traditional machine learning models and deep learning models based on BERT for privilege document classification tasks in legal document review, and we examine the effectiveness of transfer learning in privilege model on three real world datasets with privilege labels. Our results show that BERT model outperforms the industry standard logistic regression algorithm and transfer learning models can achieve decent performance on datasets in same or close domains.
Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address the issue of how to utilize transfer learning to improve the efficiency and accuracy of identifying privileged documents during the legal document review process. Traditionally, these documents are identified through keyword searches and manual review, but as the volume of data increases, this method becomes increasingly impractical in terms of cost. Machine learning methods have been used to identify privileged documents, but they typically require a large amount of labeled data. This paper hypothesizes that transfer learning can extract knowledge from existing labeled data and apply it to new datasets, thereby reducing the need for new training data. Specifically, the research objectives of the paper include: 1. Comparing the performance of traditional machine learning methods (such as logistic regression) with BERT-based deep learning models in identifying privileged documents. 2. Investigating the effectiveness of pre-trained machine learning models in predicting privileged documents in a zero-shot setting (i.e., the model is trained on a completely different dataset). Through these studies, the paper aims to explore the potential of transfer learning in legal document review, particularly in terms of reducing the need for labeled data and improving model generalization capabilities.