Abstract:Retrieval Augmented Generation (RAG) is a technique used to augment Large Language Models (LLMs) with contextually relevant, time-critical, or domain-specific information without altering the underlying model parameters. However, constructing RAG systems that can effectively synthesize information from large and diverse set of documents remains a significant challenge. We introduce a novel data-centric RAG workflow for LLMs, transforming the traditional retrieve-then-read system into a more advanced prepare-then-rewrite-then-retrieve-then-read framework, to achieve higher domain expert-level understanding of the knowledge base. Our methodology relies on generating metadata and synthetic Questions and Answers (QA) for each document, as well as introducing the new concept of Meta Knowledge Summary (MK Summary) for metadata-based clusters of documents. The proposed innovations enable personalized user-query augmentation and in-depth information retrieval across the knowledge base. Our research makes two significant contributions: using LLMs as evaluators and employing new comparative performance metrics, we demonstrate that (1) using augmented queries with synthetic question matching significantly outperforms traditional RAG pipelines that rely on document chunking (p < 0.01), and (2) meta knowledge-augmented queries additionally significantly improve retrieval precision and recall, as well as the final answers breadth, depth, relevancy, and specificity. Our methodology is cost-effective, costing less than $20 per 2000 research papers using Claude 3 Haiku, and can be adapted with any fine-tuning of either the language or embedding models to further enhance the performance of end-to-end RAG pipelines.

What problem does this paper attempt to address?

The paper aims to address the challenges faced by Retrieval-Augmented Generation (RAG) systems when dealing with large-scale and diverse document collections. Specifically, the paper proposes a new data-driven RAG workflow to enhance the performance of Large Language Models (LLMs) in knowledge-intensive tasks. The main issues include: 1. **Document Noise**: The documents in the knowledge base may contain a lot of noise, which could be due to inconsistent document formats or the complexity of the content itself. 2. **Lack of Annotated Information**: There is usually not enough manually annotated information to support the document chunking, embedding, and retrieval processes, making the entire retrieval problem difficult to personalize. 3. **Difficulty in Handling Long Documents**: When splitting and encoding long documents individually, it is challenging to extract relevant information, and the choice of splitting strategy is crucial to the quality of subsequent steps. 4. **User Query Issues**: User queries are often short and ambiguous, may have vocabulary mismatches, or require multiple documents to answer, making it difficult to accurately capture user intent and find the most suitable documents. 5. **Information Distribution Issues**: Relevant information may be scattered across multiple documents rather than concentrated in one place, making it difficult to achieve expert-level knowledge base usage through automated systems. To address the above issues, the paper proposes a new "Prepare-Rewrite-Retrieve-Read" (PR3) workflow, which improves query enhancement by generating metadata and synthetic Question-Answer (QA) pairs, and introducing Meta-Knowledge Summary (MK Summary) to enhance the relevance and depth of retrieval results. This approach does not require modifying the underlying model parameters and significantly outperforms traditional document chunk-based RAG methods in experiments.

Meta Knowledge for Retrieval Augmented Large Language Models

Retrieval-Augmented Generation for Large Language Models: A Survey

Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models

WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs

MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering

On the Role of Long-tail Knowledge in Retrieval Augmented Large Language Models

Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts

ActiveRAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

Metacognitive Retrieval-Augmented Large Language Models

Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation

KG-RAG: Bridging the Gap Between Knowledge and Creativity

Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization

A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models

Query Optimization for Parametric Knowledge Refinement in Retrieval-Augmented Large Language Models

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Deploying Large Language Models With Retrieval Augmented Generation

Improving Retrieval for RAG based Question Answering Models on Financial Documents

BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine