A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems

Florin Cuconasu,Giovanni Trappolini,Nicola Tonellotto,Fabrizio Silvestri
2024-06-21
Abstract:Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by large language models (LLMs). The current common practices in RAG involve using "instructed" LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, "Seldom is a glance at the statistics enough to understand the meaning of the figures".
Computation and Language,Information Retrieval
What problem does this paper attempt to address?
This paper mainly explores the performance differences between the base model and the instructive large language model (Instruct LLMs) in the Retrieval-Augmented Generation (RAG) system. It is generally believed that Instruct LLMs, which are fine-tuned with instructions and aligned with human preferences, outperform the base model in RAG tasks. However, the study found that, under the experimental settings, the base model performed on average 20% better than Instruct LLMs. This finding challenges the common assumption of the superiority of Instruct LLMs in RAG applications. The paper reveals that the situation is more complex than expected, and multiple factors influence the performance of the RAG system. The study also reveals that using the Instruct model, specific templates may not always improve performance and may even lead to inaccurate generated answers. In addition, the paper proposes future research directions, namely the need for a more in-depth exploration of the methodology and evaluation criteria for RAG, in order to promote the development of more effective and reliable AI systems. The study used two task instructions to evaluate the models, including extracting answers from documents and providing evidence to support the answers. The results show that the base model generally outperforms the fine-tuned models on these tasks, even when models are required to provide evidence to support their answers. Furthermore, the paper discusses the performance of the models without recommended templates and finds that templates may lead to overly verbose answers, which reduces accuracy. Overall, this paper provides a thorough comparison between the base model and the instructive model in RAG systems, revealing potential issues with current practices and offering new perspectives for understanding the behavior of RAG systems.