Meta Knowledge for Retrieval Augmented Large Language Models

Laurent Mombaerts,Terry Ding,Adi Banerjee,Florian Felice,Jonathan Taws,Tarik Borogovac
2024-08-17
Abstract:Retrieval Augmented Generation (RAG) is a technique used to augment Large Language Models (LLMs) with contextually relevant, time-critical, or domain-specific information without altering the underlying model parameters. However, constructing RAG systems that can effectively synthesize information from large and diverse set of documents remains a significant challenge. We introduce a novel data-centric RAG workflow for LLMs, transforming the traditional retrieve-then-read system into a more advanced prepare-then-rewrite-then-retrieve-then-read framework, to achieve higher domain expert-level understanding of the knowledge base. Our methodology relies on generating metadata and synthetic Questions and Answers (QA) for each document, as well as introducing the new concept of Meta Knowledge Summary (MK Summary) for metadata-based clusters of documents. The proposed innovations enable personalized user-query augmentation and in-depth information retrieval across the knowledge base. Our research makes two significant contributions: using LLMs as evaluators and employing new comparative performance metrics, we demonstrate that (1) using augmented queries with synthetic question matching significantly outperforms traditional RAG pipelines that rely on document chunking (p < 0.01), and (2) meta knowledge-augmented queries additionally significantly improve retrieval precision and recall, as well as the final answers breadth, depth, relevancy, and specificity. Our methodology is cost-effective, costing less than $20 per 2000 research papers using Claude 3 Haiku, and can be adapted with any fine-tuning of either the language or embedding models to further enhance the performance of end-to-end RAG pipelines.
Information Retrieval
What problem does this paper attempt to address?
The paper aims to address the challenges faced by Retrieval-Augmented Generation (RAG) systems when dealing with large-scale and diverse document collections. Specifically, the paper proposes a new data-driven RAG workflow to enhance the performance of Large Language Models (LLMs) in knowledge-intensive tasks. The main issues include: 1. **Document Noise**: The documents in the knowledge base may contain a lot of noise, which could be due to inconsistent document formats or the complexity of the content itself. 2. **Lack of Annotated Information**: There is usually not enough manually annotated information to support the document chunking, embedding, and retrieval processes, making the entire retrieval problem difficult to personalize. 3. **Difficulty in Handling Long Documents**: When splitting and encoding long documents individually, it is challenging to extract relevant information, and the choice of splitting strategy is crucial to the quality of subsequent steps. 4. **User Query Issues**: User queries are often short and ambiguous, may have vocabulary mismatches, or require multiple documents to answer, making it difficult to accurately capture user intent and find the most suitable documents. 5. **Information Distribution Issues**: Relevant information may be scattered across multiple documents rather than concentrated in one place, making it difficult to achieve expert-level knowledge base usage through automated systems. To address the above issues, the paper proposes a new "Prepare-Rewrite-Retrieve-Read" (PR3) workflow, which improves query enhancement by generating metadata and synthetic Question-Answer (QA) pairs, and introducing Meta-Knowledge Summary (MK Summary) to enhance the relevance and depth of retrieval results. This approach does not require modifying the underlying model parameters and significantly outperforms traditional document chunk-based RAG methods in experiments.