ChatQA: Surpassing GPT-4 on Conversational QA and RAG

Zihan Liu,Wei Ping,Rajarshi Roy,Peng Xu,Chankyu Lee,Mohammad Shoeybi,Bryan Catanzaro
2024-10-30
Abstract:In this work, we introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA). To enhance generation, we propose a two-stage instruction tuning method that significantly boosts the performance of RAG. For effective retrieval, we introduce a dense retriever optimized for conversational QA, which yields results comparable to the alternative state-of-the-art query rewriting models, while substantially reducing deployment costs. We also present the ChatRAG Bench, which encompasses ten datasets covering comprehensive evaluations on RAG, table-related QA, arithmetic calculations, and scenarios involving unanswerable questions. Our ChatQA-1.0-70B (score: 54.14), built on Llama2, a weaker foundation model than GPT-4, can slightly outperform GPT-4-0613 (score: 53.90) and GPT-4-Turbo-2024-04-09 (score: 54.03) on the ChatRAG Bench, without relying on any synthetic data from OpenAI GPT models. Notably, the Llama3-ChatQA-1.5-70B model surpasses the accuracy of GPT-4-Turbo-2024-04-09, achieving a 4.4% improvement. To advance research in this field, we open-sourced the model weights, instruction tuning data, ChatRAG Bench, and retriever for the community: <a class="link-external link-https" href="https://chatqa-project.github.io/" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to construct a model that surpasses existing state - of - the - art models (such as GPT - 4) in Conversational QA (Conversational Question Answering) and RAG (Retrieval - Augmented Generation). Specifically, the authors focus on the following points: 1. **Enhancing the ability of Conversational QA**: Enable the model to interact with users in a conversational form, support users to ask follow - up questions, and effectively integrate retrieved evidence fragments, whether in open - domain or long - document settings. 2. **Reducing dependence on specific datasets**: Enable general - purpose models to handle table - related Q&A and arithmetic calculations without fine - tuning on specific datasets, while maintaining accuracy comparable to fine - tuned models. 3. **Reducing deployment costs**: By optimizing the retriever, the model can significantly reduce deployment costs while maintaining performance. To achieve these goals, the authors propose the following methods and techniques: - **Two - stage instruction tuning method**: In the first stage, the model's instruction - following ability is enhanced through Supervised Fine - tuning (SFT); in the second stage, the model's performance in Conversational QA is further improved through Context - Enhanced Instruction Tuning. - **High - density retriever optimization**: A dense retriever optimized specifically for Conversational QA is introduced. This retriever performs well in multi - round conversations and has a low deployment cost. - **Comprehensive evaluation benchmark**: CHATRAG BENCH is constructed, which contains ten datasets, covering RAG, table - related Q&A, arithmetic calculations, and unanswerable question scenarios. Through these methods, the authors show that their model ChatQA - 1.0 - 70B surpasses GPT - 4 on multiple tasks, especially without using any synthetic data from OpenAI GPT models. In addition, they also open - source the model weights, instruction - tuning data, CHATRAG BENCH, and the retriever to promote research progress in this field.