ExaRanker-Open: Synthetic Explanation for IR using Open-Source LLMs

Fernando Ferraretto, Thiago Laitz, Roberto Lotufo, Rodrigo Nogueira
2024-02-10
Abstract:ExaRanker recently introduced an approach to training information retrieval (IR) models, incorporating natural language explanations as additional labels. The method addresses the challenge of limited labeled examples, leading to improvements in the effectiveness of IR models. However, the initial results were based on proprietary language models such as GPT-3.5, which posed constraints on dataset size due to its cost and data privacy. In this paper, we introduce ExaRanker-Open, where we adapt and explore the use of open-source language models to generate explanations. The method has been tested using different LLMs and datasets sizes to better comprehend the effective contribution of data augmentation. Our findings reveal that incorporating explanations consistently enhances neural rankers, with benefits escalating as the LLM size increases. Notably, the data augmentation method proves advantageous even with large datasets, as evidenced by ExaRanker surpassing the target baseline by 0.6 nDCG@10 points in our study. To encourage further advancements by the research community, we have open-sourced both the code and datasets at https://github.com/unicamp-dl/ExaRanker.
Artificial Intelligence,Computation and Language,Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address the issue of insufficient training data annotation in the field of Information Retrieval (IR). Specifically, the authors propose a method to enhance training datasets by using open-source large language models (LLMs) to generate natural language explanations. This approach aims to overcome the cost and data privacy limitations associated with previous research that relied on proprietary language models (such as GPT-3.5), and further validates the continuous improvement in neural ranker performance with the addition of explanations across different dataset sizes. The main contributions of the paper include: 1. **Introduction of ExaRanker-Open**: This is an improved version of ExaRanker based on open-source language models, used to generate natural language explanations to enhance the training data of information retrieval models. 2. **Validation of data augmentation effectiveness**: Experimental results show that adding explanations significantly improves model performance regardless of dataset size, with more pronounced effects when using larger language models. 3. **Public release of code and datasets**: To promote community research progress, the authors have made the code and datasets publicly available. Through these efforts, the paper demonstrates the potential of using open-source language models for data augmentation in the field of information retrieval, providing new directions and tools for future research.