ICLERB: In-Context Learning Embedding and Reranker Benchmark

Marie Al Ghossein,Emile Contal,Alexandre Robicquet
2024-11-28
Abstract:In-Context Learning (ICL) enables Large Language Models (LLMs) to perform new tasks by conditioning on prompts with relevant information. Retrieval-Augmented Generation (RAG) enhances ICL by incorporating retrieved documents into the LLM's context at query time. However, traditional retrieval methods focus on semantic relevance, treating retrieval as a search problem. In this paper, we propose reframing retrieval for ICL as a recommendation problem, aiming to select documents that maximize utility in ICL tasks. We introduce the In-Context Learning Embedding and Reranker Benchmark (ICLERB), a novel evaluation framework that compares retrievers based on their ability to enhance LLM accuracy in ICL settings. Additionally, we propose a novel Reinforcement Learning-to-Rank from AI Feedback (RLRAIF) algorithm, designed to fine-tune retrieval models using minimal feedback from the LLM. Our experimental results reveal notable differences between ICLERB and existing benchmarks, and demonstrate that small models fine-tuned with our RLRAIF algorithm outperform large state-of-the-art retrieval models. These findings highlight the limitations of existing evaluation methods and the need for specialized benchmarks and training strategies adapted to ICL.
Machine Learning,Information Retrieval
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the limitations of existing retrieval methods in In - Context Learning (ICL). Specifically, the author points out that traditional retrieval methods mainly focus on semantic similarity and overlook the actual improvement effect of retrieved documents on the performance of large - language models (LLMs). To this end, the paper proposes the following key issues: 1. **Limitations of traditional retrieval methods**: - Existing retrieval methods regard the retrieval task as a search problem, focusing on finding documents that are semantically similar to the query rather than selecting documents that can maximize the utility of ICL tasks. - These methods rely on traditional evaluation benchmarks such as BEIR and MTEB, which mainly measure semantic similarity and fail to fully reflect the actual impact of retrieved documents on LLM performance. 2. **Lack of evaluation benchmarks specifically for ICL**: - Currently, there are no benchmarks specifically designed to evaluate the performance of retrievers in ICL scenarios, which limits researchers' ability to effectively compare different retrieval methods. 3. **Challenges in optimizing retrieval models to improve ICL performance**: - In order to optimize retrieval models to improve LLM performance in ICL tasks, it is necessary to generate datasets that reflect document utility, but this is computationally very expensive. - Evaluating all possible query - document pairs is infeasible because the number of combinations is huge, and simple subsampling methods are not sufficient to solve the problem. To solve these problems, the paper proposes a new evaluation framework - **In - Context Learning Embedding and Reranker Benchmark (ICLERB)**, and a ranking - optimization algorithm based on reinforcement learning - **Reinforcement Learning - to - Rank from AI Feedback (RLRAIF)**. The specific contributions are as follows: 1. **Novel evaluation method**: - A method for evaluating retrievers based on the degree of improvement in LLM performance by retrieved documents is proposed, instead of the traditional semantic similarity evaluation. 2. **ICLERB benchmark**: - A new benchmark framework, ICLERB, is introduced to evaluate the performance of embedding models and rerankers in ICL tasks, covering multiple datasets and LLMs. 3. **RLRAIF method**: - A new RLRAIF algorithm is proposed, which directly optimizes retrieval models using feedback from LLMs, thereby achieving better retrieval performance with a minimal LLM query budget. Through these innovations, the paper not only addresses the shortcomings of existing retrieval methods in ICL but also provides a standardized evaluation platform, promoting further research and development in this field.