Abstract:Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by large language models (LLMs). The current common practices in RAG involve using "instructed" LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, "Seldom is a glance at the statistics enough to understand the meaning of the figures".

What problem does this paper attempt to address?

This paper mainly explores the performance differences between the base model and the instructive large language model (Instruct LLMs) in the Retrieval-Augmented Generation (RAG) system. It is generally believed that Instruct LLMs, which are fine-tuned with instructions and aligned with human preferences, outperform the base model in RAG tasks. However, the study found that, under the experimental settings, the base model performed on average 20% better than Instruct LLMs. This finding challenges the common assumption of the superiority of Instruct LLMs in RAG applications. The paper reveals that the situation is more complex than expected, and multiple factors influence the performance of the RAG system. The study also reveals that using the Instruct model, specific templates may not always improve performance and may even lead to inaccurate generated answers. In addition, the paper proposes future research directions, namely the need for a more in-depth exploration of the methodology and evaluation criteria for RAG, in order to promote the development of more effective and reliable AI systems. The study used two task instructions to evaluate the models, including extracting answers from documents and providing evidence to support the answers. The results show that the base model generally outperforms the fine-tuned models on these tasks, even when models are required to provide evidence to support their answers. Furthermore, the paper discusses the performance of the models without recommended templates and finds that templates may lead to overly verbose answers, which reduces accuracy. Overall, this paper provides a thorough comparison between the base model and the instructive model in RAG systems, revealing potential issues with current practices and offering new perspectives for understanding the behavior of RAG systems.

A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

Introducing Super RAGs in Mistral 8x7B-v1

Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models

Pistis-RAG: A Scalable Cascading Framework Towards Trustworthy Retrieval-Augmented Generation

SFR-RAG: Towards Contextually Faithful LLMs

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

The Power of Noise: Redefining Retrieval for RAG Systems

Invar-RAG: Invariant LLM-aligned Retrieval for Better Generation

Faculty Perspectives on the Potential of RAG in Computer Science Higher Education

Retrieval Augmented Generation Systems: Automatic Dataset Creation, Evaluation and Boolean Agent Setup

InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems