Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

Simone Alghisi,Massimo Rizzoli,Gabriel Roccabruna,Seyed Mahed Mousavi,Giuseppe Riccardi
2024-08-03
Abstract:We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open-Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama-2 and Mistral, and four dialogue types Open-Domain, Knowledge-Grounded, Task-Oriented, and Question Answering. We evaluate the performance of in-context learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue. Last but not least, the assessment of the best adaptation technique should include human evaluation to avoid false expectations and outcomes derived from automatic metrics.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the effectiveness of different techniques in adapting large language models (LLMs) to generate human - machine dialogue responses. Specifically, the paper focuses on the following points: 1. **Limitations of Adaptation Techniques**: The paper points out that although large language models have been applied to various types of dialogues (such as open - ended dialogue, knowledge - based dialogue, task - oriented dialogue, and question - answering), these models have problems generating toxic, biased, or irrelevant responses when generating dialogue responses. Therefore, researchers have proposed different adaptation techniques, such as in - context learning and fine - tuning, as well as strategies for incorporating external knowledge, such as knowledge bases and retrieval - augmented generation (RAG). 2. **Limitations of Existing Research**: Current research on these adaptation techniques mainly focuses on specific types of dialogue and is based on different base models and evaluation methods, which limits the understanding of the overall effectiveness of these techniques. 3. **Evaluation across Dialogue Types**: In order to gain a more comprehensive understanding of the effectiveness of these techniques, the paper selects two base models (Llama - 2 Chat and Mistral Instruct) and four dialogue types (open - ended dialogue, knowledge - based dialogue, task - oriented dialogue, and question - answering) to systematically analyze the performance of in - context learning and fine - tuning techniques in different dialogue types. 4. **Impact of External Knowledge**: The paper also evaluates the impact of incorporating external knowledge (such as retrieved knowledge and gold - standard knowledge) during the generation process to improve the generation quality. 5. **Consistency of Evaluation Methods**: In order to ensure the fairness and comparability of the evaluation, the paper adopts unified automatic evaluation metrics and human evaluation protocols, including dimensions such as contextual consistency, appropriateness, correctness, and effectiveness of the dialogue. Through the above research, the paper aims to provide a comprehensive perspective to help researchers and developers better understand and select adaptation techniques suitable for specific dialogue tasks.