Abstract:We investigate whether in-context examples, widely used in decoder-only language models (LLMs), can improve embedding model performance in retrieval tasks. Unlike in LLMs, naively prepending in-context examples (query-document pairs) to the target query at inference time does not work out of the box. We introduce a simple approach to enable retrievers to use in-context examples. Our approach, RARe, finetunes a pre-trained model with in-context examples whose query is semantically similar to the target query. This can be applied to adapt various base architectures (i.e., decoder-only language models, retriever models) and consistently achieves performance gains of up to +2.72% nDCG across various open-domain retrieval datasets (BeIR, RAR-b). In particular, we find RARe exhibits stronger out-of-domain generalization compared to models using queries without in-context examples, similar to what is seen for in-context learning in LLMs. We further provide analysis on the design choices of in-context example augmentation and lay the foundation for future work in this space.

What problem does this paper attempt to address?

This paper attempts to solve the problem of how to use in - context examples to enhance the performance of retrieval models. Specifically, the authors study how to effectively use in - context examples that are semantically similar to the target query in retrieval tasks to improve the performance of embedding models. Unlike decoder - only language models (LLMs), simply appending in - context examples to the target query during inference does not directly improve performance. Therefore, the authors propose a new method - RARe (Retrieval Augmented Retrieval with In - Context Examples), which utilizes these in - context examples by fine - tuning pre - trained models. ### Main Contributions 1. **Introduction of RARe**: A method adapted to pre - trained models is proposed, enabling them to utilize in - context examples in retrieval tasks. 2. **Performance Improvement of Multiple Basic Architectures**: It is shown that this method can be applied to various basic architectures (such as decoder - only language models and existing retrieval models) and achieves significant performance improvements in multiple tasks. 3. **Detailed Analysis**: A detailed analysis of the impact of the quality, quantity, and selection of in - context examples on performance is provided, explaining the sources of experimental gains. ### Method Overview - **Query Augmentation**: Through sparse retrieval techniques such as BM25, find in - context examples that are semantically similar to the target query and append them to the original query. - **Fine - Tuning**: Use the contrastive loss function to fine - tune the model so that the model can better utilize these in - context examples. ### Experimental Setup - **Benchmark Datasets**: Widely used retrieval benchmark datasets such as BeIR and RAR - b are used for evaluation. - **Baseline Models**: High - performance models including SFR - Embedding - 2 - R, LLM2Vec - Llama - 3 - 8B - Supervised, and E5 - Mistral - 7B - Instruct are included. - **Evaluation Metrics**: nDCG@10 is mainly used as an evaluation metric to measure the quality of retrieval results. ### Experimental Results - **Inference - Time Modification**: Directly adding in - context examples during inference leads to a performance decline. - **Training from LLM Checkpoints**: Starting training from LLM checkpoints, RARe achieves significant performance improvements on multiple benchmark datasets. Especially on the RAR - b benchmark, the absolute gain reaches + 2.72%. - **Continuing Training from Retriever Checkpoints**: Continuing to train existing retriever models, RARe also performs well on most tasks. Especially on out - of - domain tasks, the performance is improved by 1.95% compared to the instruction - only baseline method. ### Discussion and Analysis - **Selection of In - Context Examples**: Using retrieved in - context examples is more effective than randomly selected examples. - **Relevance of In - Context Examples**: When in - context examples are highly relevant to the target query, the performance improvement of RARe is most obvious. - **Quantity of In - Context Examples**: Increasing the number of in - context examples usually improves performance, but the optimal number may vary depending on the dataset. - **Efficiency Analysis**: Although adding in - context examples will increase the latency of the retrieval pipeline, in large - scale datasets, the impact of this latency is relatively small. In conclusion, through proposing the RARe method, this paper successfully solves the problem of how to use in - context examples in retrieval tasks to improve model performance and lays the foundation for future research.

RARe: Retrieval Augmented Retrieval with In-Context Examples

RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models

In-Context Retrieval-Augmented Language Models

Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning

Making Retrieval-Augmented Language Models Robust to Irrelevant Context

Expand, Rerank, and Retrieve: Query Reranking for Open-Domain Question Answering.

RAR-b: Reasoning as Retrieval Benchmark

RAVEN: Multitask Retrieval Augmented Vision-Language Learning

Retrieval meets Long Context Large Language Models

You Only Use Reactive Attention Slice For Long Context Retrieval

REALM: Retrieval-Augmented Language Model Pre-Training

Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Invar-RAG: Invariant LLM-aligned Retrieval for Better Generation

DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning

More Room for Language: Investigating the Effect of Retrieval on Language Models

Learning to Retrieve Iteratively for In-Context Learning

In-context Learning with Retrieved Demonstrations for Language Models: A Survey

RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback

Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks

Learning To Retrieve Prompts for In-Context Learning

Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts