Abstract:Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.

Retrieval augmented generation for building datasets from scientific literature

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report

Deploying Large Language Models With Retrieval Augmented Generation

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Meta Knowledge for Retrieval Augmented Large Language Models

Retrieval-Augmented Generation for Large Language Models: A Survey

Towards Optimizing a Retrieval Augmented Generation using Large Language Model on Academic Data

Development and Testing of Retrieval Augmented Generation in Large Language Models -- A Case Study Report

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

A Method for Parsing and Vectorization of Semi-structured Data used in Retrieval Augmented Generation

A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions

Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine

AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline

Retrieval Augmented Generation Systems: Automatic Dataset Creation, Evaluation and Boolean Agent Setup

A Survey on Retrieval-Augmented Text Generation for Large Language Models

Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models