RAFT: Adapting Language Model to Domain Specific RAG

Tianjun Zhang,Shishir G. Patil,Naman Jain,Sheng Shen,Matei Zaharia,Ion Stoica,Joseph E. Gonzalez
2024-06-06
Abstract:Pretraining Large Language Models (LLMs) on large corpora of textual data is now a standard paradigm. When using these LLMs for many downstream applications, it is common to additionally bake in new knowledge (e.g., time-critical news, or private domain knowledge) into the pretrained model either through RAG-based-prompting, or fine-tuning. However, the optimal methodology for the model to gain such new knowledge remains an open question. In this paper, we present Retrieval Augmented FineTuning (RAFT), a training recipe that improves the model's ability to answer questions in a "open-book" in-domain settings. In RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don't help in answering the question, which we call, distractor documents. RAFT accomplishes this by citing verbatim the right sequence from the relevant document that would help answer the question. This coupled with RAFT's chain-of-thought-style response helps improve the model's ability to reason. In domain-specific RAG, RAFT consistently improves the model's performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG. RAFT's code and demo are open-sourced at <a class="link-external link-http" href="http://github.com/ShishirPatil/gorilla" rel="external noopener nofollow">this http URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper attempts to address the problem of adapting pre-trained large-scale language models (LLMs) to domain-specific Retrieval-Augmented Generation (RAG). Specifically, the authors propose a new method called **Retrieval-Augmented Fine-Tuning (RAFT)**, aimed at improving the model's performance in an open-book exam setting within a specific domain. ### Background and Motivation 1. **Generality and Limitations of Large-Scale Language Models**: - Large-scale language models (LLMs) have made significant progress in a wide range of general knowledge reasoning tasks through pre-training on vast amounts of text data. - However, when applying these models to specific domains, the importance of general knowledge reasoning diminishes, and the key objective is to maximize accuracy based on a given set of documents. 2. **Shortcomings of Existing Methods**: - **Retrieval-Augmented Generation (RAG)**: Allows the model to reference documents when answering questions but fails to fully utilize the learning opportunities in a fixed domain setting. - **Supervised Fine-Tuning**: Can learn more general patterns from documents and better align the final task with user preferences, but cannot leverage retrieved documents during testing. 3. **Analogy to Open-Book Exams**: - Existing retrieval-augmented methods are akin to unprepared open-book exams. - Existing supervised fine-tuning methods either directly "memorize" the input documents or answer practice questions without referencing the documents, failing to prepare for the open-book exam setting. ### Proposed Method 1. **Core Idea of RAFT**: - **Combining Instruction Fine-Tuning and Retrieval-Augmented Generation**: RAFT trains the model to correctly answer questions in the presence of distracting information by introducing questions, relevant documents, and distracting documents during training. - **Chain-of-Thought**: Generates detailed reasoning processes, referencing relevant content from the original text, enhancing the model's reasoning ability. 2. **Construction of Training Data**: - Each data point includes a question (Q), a set of documents (Dk), and a chain-of-thought style answer (A*). - Distinguishes between two types of documents: **Gold Standard Documents (D*)** (containing relevant information for the answer) and **Distractor Documents (Di)** (not containing relevant information for the answer). - Retains gold standard documents in some data points, while others only include distractor documents to enhance the model's memory and contextual understanding. 3. **Experimental Design**: - Evaluates RAFT's performance using multiple datasets (e.g., PubMed, HotpotQA, HuggingFace Hub, Torch Hub, TensorFlow Hub). - Compares the performance of different baseline models (e.g., LlaMA2-7B, LlaMA2-7B + RAG, domain-specific fine-tuned models, etc.). ### Experimental Results 1. **Performance Improvement**: - RAFT significantly outperforms existing domain-specific fine-tuning methods on all domain-specific datasets, whether combined with RAG or not. - For example, on the HotpotQA dataset, RAFT's performance improved by 35.25%, and on the Torch Hub dataset, it improved by 76.35%. 2. **Effect of Chain-of-Thought**: - Introducing chain-of-thought significantly improved the model's training robustness and accuracy, preventing overfitting to short answers. 3. **Impact of Distractor Documents**: - Introducing a certain proportion of distractor documents during training helps improve the model's robustness when handling different numbers of test documents. ### Conclusion By combining instruction fine-tuning and retrieval-augmented generation, RAFT effectively improves the performance of large-scale language models in an open-book exam setting within specific domains. This method not only enhances the model's contextual understanding and reasoning ability but also improves its robustness in handling distracting information.