Abstract:In-Context Learning (ICL) enables Large Language Models (LLMs) to perform new tasks by conditioning on prompts with relevant information. Retrieval-Augmented Generation (RAG) enhances ICL by incorporating retrieved documents into the LLM's context at query time. However, traditional retrieval methods focus on semantic relevance, treating retrieval as a search problem. In this paper, we propose reframing retrieval for ICL as a recommendation problem, aiming to select documents that maximize utility in ICL tasks. We introduce the In-Context Learning Embedding and Reranker Benchmark (ICLERB), a novel evaluation framework that compares retrievers based on their ability to enhance LLM accuracy in ICL settings. Additionally, we propose a novel Reinforcement Learning-to-Rank from AI Feedback (RLRAIF) algorithm, designed to fine-tune retrieval models using minimal feedback from the LLM. Our experimental results reveal notable differences between ICLERB and existing benchmarks, and demonstrate that small models fine-tuned with our RLRAIF algorithm outperform large state-of-the-art retrieval models. These findings highlight the limitations of existing evaluation methods and the need for specialized benchmarks and training strategies adapted to ICL.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the limitations of existing retrieval methods in In - Context Learning (ICL). Specifically, the author points out that traditional retrieval methods mainly focus on semantic similarity and overlook the actual improvement effect of retrieved documents on the performance of large - language models (LLMs). To this end, the paper proposes the following key issues: 1. **Limitations of traditional retrieval methods**: - Existing retrieval methods regard the retrieval task as a search problem, focusing on finding documents that are semantically similar to the query rather than selecting documents that can maximize the utility of ICL tasks. - These methods rely on traditional evaluation benchmarks such as BEIR and MTEB, which mainly measure semantic similarity and fail to fully reflect the actual impact of retrieved documents on LLM performance. 2. **Lack of evaluation benchmarks specifically for ICL**: - Currently, there are no benchmarks specifically designed to evaluate the performance of retrievers in ICL scenarios, which limits researchers' ability to effectively compare different retrieval methods. 3. **Challenges in optimizing retrieval models to improve ICL performance**: - In order to optimize retrieval models to improve LLM performance in ICL tasks, it is necessary to generate datasets that reflect document utility, but this is computationally very expensive. - Evaluating all possible query - document pairs is infeasible because the number of combinations is huge, and simple subsampling methods are not sufficient to solve the problem. To solve these problems, the paper proposes a new evaluation framework - **In - Context Learning Embedding and Reranker Benchmark (ICLERB)**, and a ranking - optimization algorithm based on reinforcement learning - **Reinforcement Learning - to - Rank from AI Feedback (RLRAIF)**. The specific contributions are as follows: 1. **Novel evaluation method**: - A method for evaluating retrievers based on the degree of improvement in LLM performance by retrieved documents is proposed, instead of the traditional semantic similarity evaluation. 2. **ICLERB benchmark**: - A new benchmark framework, ICLERB, is introduced to evaluate the performance of embedding models and rerankers in ICL tasks, covering multiple datasets and LLMs. 3. **RLRAIF method**: - A new RLRAIF algorithm is proposed, which directly optimizes retrieval models using feedback from LLMs, thereby achieving better retrieval performance with a minimal LLM query budget. Through these innovations, the paper not only addresses the shortcomings of existing retrieval methods in ICL but also provides a standardized evaluation platform, promoting further research and development in this field.

ICLERB: In-Context Learning Embedding and Reranker Benchmark

"In-Context Learning" or: How I learned to stop worrying and love "Applied Information Retrieval"

In-Context Learning with Reinforcement Learning for Incomplete Utterance Rewriting

Learning to Retrieve Iteratively for In-Context Learning

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

RAR-b: Reasoning as Retrieval Benchmark

IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios

Adapting Interactional Observation Embedding for Counterfactual Learning to Rank

Improving Bilingual Lexicon Induction with Cross-Encoder Reranking

Improving In-Context Learning with Small Language Model Ensembles

Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction

ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

Instructive Code Retriever: Learn from Large Language Model's Feedback for Code Intelligence Tasks

IERL: Interpretable Ensemble Representation Learning -- Combining CrowdSourced Knowledge and Distributed Semantic Representations

RLRF4Rec: Reinforcement Learning from Recsys Feedback for Enhanced Recommendation Reranking

CoRAL: Collaborative Retrieval-Augmented Large Language Models Improve Long-tail Recommendation

In-Context Retrieval-Augmented Language Models

Evaluating and Safeguarding the Adversarial Robustness of Retrieval-Based In-Context Learning

Retrieval or Global Context Understanding? On Many-Shot In-Context Learning for Long-Context Evaluation

RAGSys: Item-Cold-Start Recommender as RAG System