Contextualization with SPLADE for High Recall Retrieval

Eugene Yang
DOI: https://doi.org/10.1145/3626772.3657919
2024-05-07
Abstract:High Recall Retrieval (HRR), such as eDiscovery and medical systematic review, is a search problem that optimizes the cost of retrieving most relevant documents in a given collection. Iterative approaches, such as iterative relevance feedback and uncertainty sampling, are shown to be effective under various operational scenarios. Despite neural models demonstrating success in other text-related tasks, linear models such as logistic regression, in general, are still more effective and efficient in HRR since the model is trained and retrieves documents from the same fixed collection. In this work, we leverage SPLADE, an efficient retrieval model that transforms documents into contextualized sparse vectors, for HRR. Our approach combines the best of both worlds, leveraging both the contextualization from pretrained language models and the efficiency of linear models. It reduces 10% and 18% of the review cost in two HRR evaluation collections under a one-phase review workflow with a target recall of 80%. The experiment is implemented with TARexp and is available at
Information Retrieval
What problem does this paper attempt to address?
The paper primarily explores how to leverage the advantages of Pretrained Language Models (PLMs) to improve retrieval efficiency and effectiveness in High Recall Retrieval (HRR) tasks. Specifically, the authors propose a method that uses SPLADE (a highly efficient sparse retrieval model) to convert documents into context-sensitive sparse vectors, and then input these vectors as features into a linear model for classification. The core contributions of the paper include: 1. Proposing an effective sparse classification model for HRR tasks, which combines the contextual understanding capabilities of pretrained language models with the efficiency of linear models. 2. Conducting comprehensive experiments under two different workflows (single-stage and two-stage) to test the effectiveness of the proposed method. 3. Performing ablation studies to analyze the impact of different pretrained language model choices on the final results. Experimental results show that on two HRR evaluation sets (RCV1-v2 and Jeb Bush), combining context features generated by SPLADE with traditional BM25 features can significantly reduce the total review cost compared to the baseline BM25 model, with a maximum reduction of about 27%. Additionally, the proposed combined method also demonstrates good performance across categories of varying difficulty and generality. Overall, this paper aims to improve existing HRR techniques by combining the powerful expressive capabilities of pretrained language models with the efficiency of linear models, particularly in application scenarios requiring high recall rates, such as legal document review.