Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

Siyun Zhao,Yuqing Yang,Zilong Wang,Zhiyuan He,Luna K. Qiu,Lili Qiu
2024-09-23
Abstract:Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the challenges faced by large language models (LLMs) when integrating with external data in practical applications. Specifically, the paper focuses on the following aspects: 1. **Enhancing Expertise and Timeliness**: The training data of LLMs often has a lag in timeliness and may not cover detailed information in all fields, especially proprietary data owned by users. By integrating external data, LLM applications can provide more detailed and accurate answers, and support data updates and customization. 2. **Alignment with Domain Experts**: By learning and utilizing data from specific fields, LLM applications can exhibit capabilities more akin to domain experts, such as the professional knowledge of doctors and lawyers. 3. **Reducing Model Hallucinations**: Generating responses based on real data can ground the model's reactions in facts, significantly reducing hallucinations. 4. **Improving Controllability and Interpretability**: External data can serve as a reference for model predictions, enhancing their controllability and interpretability. Despite these significant advantages, developers often face numerous challenges in practical applications, including: - **Building Data Pipelines**: Constructing efficient data pipelines, from data processing to indexing, is a complex task. - **Utilizing LLMs' Capabilities for Complex Reasoning**: Different fields have different requirements for LLMs. For example, the financial sector needs to understand high-dimensional time series data, while the medical field needs to handle medical images or time series medical records. - **Handling Long-Distance Dependencies**: In legal and mathematical applications, LLMs struggle to grasp long-distance dependencies between different structures. - **Interpretability and Consistency**: In certain fields, such as healthcare and law, there are higher requirements for the interpretability and consistency of LLM responses. To address these challenges, the paper proposes a RAG task classification method, dividing user queries into four levels, each corresponding to different types of external data and task focuses: 1. **Explicit Fact Queries**: Directly extracting explicit facts from given data without additional reasoning. 2. **Implicit Fact Queries**: Answering implicit facts that require some common-sense reasoning or basic logical inference. 3. **Explainable Reasoning Queries**: Not only understanding factual content but also applying specific domain reasoning logic. 4. **Hidden Reasoning Queries**: Inferring unrecorded reasoning chains and logical relationships from external data. The paper also discusses three methods for integrating external data into LLMs: context integration, small model integration, and fine-tuning, each with its own advantages, disadvantages, and applicable scenarios. Ultimately, the paper aims to help readers fully understand the data requirements and key bottlenecks in building LLM applications, provide solutions, and guide the systematic development of these applications.