The Power of Noise: Redefining Retrieval for RAG Systems

Florin Cuconasu,Giovanni Trappolini,Federico Siciliano,Simone Filice,Cesare Campagnano,Yoelle Maarek,Nicola Tonellotto,Fabrizio Silvestri
DOI: https://doi.org/10.48550/arXiv.2401.14887
2024-01-26
Information Retrieval
Abstract:Retrieval-Augmented Generation (RAG) systems represent a significant advancement over traditional Large Language Models (LLMs). RAG systems enhance their generation ability by incorporating external data retrieved through an Information Retrieval (IR) phase, overcoming the limitations of standard LLMs, which are restricted to their pre-trained knowledge and limited context window. Most research in this area has predominantly concentrated on the generative aspect of LLMs within RAG systems. Our study fills this gap by thoroughly and critically analyzing the influence of IR components on RAG systems. This paper analyzes which characteristics a retriever should possess for an effective RAG's prompt formulation, focusing on the type of documents that should be retrieved. We evaluate various elements, such as the relevance of the documents to the prompt, their position, and the number included in the context. Our findings reveal, among other insights, that including irrelevant documents can unexpectedly enhance performance by more than 30% in accuracy, contradicting our initial assumption of diminished quality. These results underscore the need for developing specialized strategies to integrate retrieval with language generation models, thereby laying the groundwork for future research in this field.
What problem does this paper attempt to address?
The problem this paper attempts to address is the impact of the information retrieval (IR) component on the performance of retrieval-augmented generation (RAG) systems. Specifically, the paper focuses on the key characteristics that a retriever should possess when constructing effective RAG prompts. The authors analyze the impact of different types of documents (relevant documents, related documents, and irrelevant documents) on RAG systems and find that including a certain number of irrelevant documents can actually improve the system's accuracy, which contradicts traditional views. Therefore, the paper aims to explore how to optimize retrieval strategies to better integrate with language generation models, thereby enhancing the overall performance of RAG systems.