Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Bowen Jin,Jinsung Yoon,Jiawei Han,Sercan O. Arik
2024-10-08
Abstract:Retrieval-augmented generation (RAG) empowers large language models (LLMs) to utilize external knowledge sources. The increasing capacity of LLMs to process longer input sequences opens up avenues for providing more retrieved information, to potentially enhance the quality of generated outputs. It is plausible to assume that a larger retrieval set would contain more relevant information (higher recall), that might result in improved performance. However, our empirical findings demonstrate that for many long-context LLMs, the quality of generated output initially improves first, but then subsequently declines as the number of retrieved passages increases. This paper investigates this phenomenon, identifying the detrimental impact of retrieved "hard negatives" as a key contributor. To mitigate this and enhance the robustness of long-context LLM-based RAG, we propose both training-free and training-based approaches. We first showcase the effectiveness of retrieval reordering as a simple yet powerful training-free optimization. Furthermore, we explore training-based methods, specifically RAG-specific implicit LLM fine-tuning and RAG-oriented fine-tuning with intermediate reasoning, demonstrating their capacity for substantial performance gains. Finally, we conduct a systematic analysis of design choices for these training-based methods, including data distribution, retriever selection, and training context length.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the challenges faced by long-context large language models (LLMs) in retrieval-augmented generation (RAG) systems. Specifically, the study finds that as the number of retrieved passages increases, the generation performance of long-context LLMs initially improves but then declines. The main reason for this phenomenon is that retrieved "hard negatives" negatively impact the model's generation results. Therefore, the paper proposes several methods to mitigate this issue to enhance the robustness and performance of long-context LLMs in RAG applications. ### Main Research Questions 1. **Impact of Increasing Retrieval Context Size on RAG Performance**: - Research Question: Can long-context LLMs in RAG systems continuously improve performance by increasing the number of retrieved passages? - Experimental Setup: Systematically vary the number of retrieved passages using different retrievers (e.g., e5 and BM25) and four long-context LLMs (Gemma-7B-Chat, Gemma-2-9B-Chat, Mistral-Nemo-12B-Instruct, and Gemini-1.5-Pro). - Observations: Increasing the number of retrieved passages initially improves performance but then declines. The performance drop is more pronounced when using high-recall retrievers (e.g., e5). 2. **Interaction Between Retrieval Quality and LLM Capability**: - Research Question: Is the performance bottleneck due to the retriever's limited ability to identify relevant information or the long-context LLMs' limited ability to process the retrieved information? - Experimental Setup: Analyze the relationship between RAG performance and retrieval quality (recall and precision) using the Gemma-2-9B-Chat LLM and two retrievers (e5 and BM25). - Observations: Increasing the number of retrieved passages leads to higher recall but lower precision, and the overall accuracy of the RAG system is lower than the recall rate. This indicates that irrelevant passages negatively impact the LLMs' performance. 3. **Importance of Hard Negatives in Evaluating Long-Context LLMs**: - Research Question: How robust are current long-context LLMs to hard negatives? How do hard negatives generated by different retrievers affect performance? - Experimental Setup: In controlled experiments, use three LLMs (Gemma2-7B-Chat, Mistral-Nemo-12B-Instruct, and Gemini-1.5-Pro) to answer queries based on a golden passage containing a correct answer and varying numbers of hard negative passages (retrieved by different methods). - Observations: Increasing the number of hard negative passages generally leads to a decline in the accuracy of RAG answers. Hard negatives generated by stronger retrievers (e.g., e5) have a greater impact on LLMs. ### Proposed Methods 1. **Retrieval Reordering**: - Utilize the "middle lost" phenomenon of LLMs to reorder retrieved passages by relevance scores, placing high-scoring passages at the beginning and end of the input sequence to reduce the impact of hard negatives in the middle positions on performance. 2. **Implicit Robustness Fine-Tuning**: - Enhance LLMs' robustness to hard negatives by exposing them to diverse retrieval contexts during fine-tuning, enabling them to effectively identify and utilize relevant information. 3. **Explicit Relevance Fine-Tuning**: - Introduce intermediate reasoning steps during fine-tuning to train LLMs to analyze retrieved passages and explicitly identify relevant information, improving their ability to distinguish relevant information from noise. ### Main Contributions - **Systematic Analysis of Long-Context RAG**: Systematically analyzed the application of long-context LLMs in RAG systems, particularly the impact of retrieved "hard negatives" on performance. - **Novel Robust RAG Methods**: Proposed three methods to enhance the robustness of long-context LLMs in RAG, including retrieval reordering, implicit robustness fine-tuning, and explicit relevance fine-tuning. - **Comprehensive Study of RAG-Specific LLMs**: Conducted a comprehensive study of LLMs specific to RAG applications, providing insights into their performance and robustness.