Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

Robert Lakatos,Peter Pollner,Andras Hajdu,Tamas Joo

2024-03-13

Abstract:The development of generative large language models (G-LLM) opened up new opportunities for the development of new types of knowledge-based systems similar to ChatGPT, Bing, or Gemini. Fine-tuning (FN) and Retrieval-Augmented Generation (RAG) are the techniques that can be used to implement domain adaptation for the development of G-LLM-based knowledge systems. In our study, using ROUGE, BLEU, METEOR scores, and cosine similarity, we compare and examine the performance of RAG and FN for the GPT-J-6B, OPT-6.7B, LlaMA, LlaMA-2 language models. Based on measurements shown on different datasets, we demonstrate that RAG-based constructions are more efficient than models produced with FN. We point out that connecting RAG and FN is not trivial, because connecting FN models with RAG can cause a decrease in performance. Furthermore, we outline a simple RAG-based architecture which, on average, outperforms the FN models by 16% in terms of the ROGUE score, 15% in the case of the BLEU score, and 53% based on the cosine similarity. This shows the significant advantage of RAG over FN in terms of hallucination, which is not offset by the fact that the average 8% better METEOR score of FN models indicates greater creativity compared to RAG.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper primarily explores the performance comparison between Retrieval-Augmented Generation (RAG) and Fine-Tuning (FN) techniques when developing knowledge systems based on Generative Large Language Models (G-LLM). Specifically, the study evaluates the performance of different language models (such as GPT-J-6B, OPT-6.7B, LlaMA, and LlaMA-2) using either RAG or FN methods through a series of experiments, and measures their performance using various metrics including ROUGE, BLEU, METEOR, and cosine similarity. The research found that RAG demonstrated better effectiveness compared to FN in building knowledge systems. Specifically, RAG not only excelled in reducing hallucinations but also showed better scalability—enhancing the system's knowledge level simply by adding new information to the database without the need to retrain the model. Additionally, although FN models slightly outperformed in METEOR scores, indicating higher creativity, RAG achieved better results in ROUGE scores and cosine similarity. This suggests that the content generated by RAG is closer to the reference text and performs better in terms of semantic similarity. In summary, the core objective of the paper is to determine through experimental comparative analysis which method, RAG or FN, is more effective in developing knowledge systems for specific domains. The research results support the view that RAG is the superior choice and point out that combining RAG with FN is not straightforward, as such a combination does not lead to significant performance improvements.

Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

Towards Optimizing a Retrieval Augmented Generation using Large Language Model on Academic Data

Retrieval-Augmented Generation for Large Language Models: A Survey

Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge

Meta Knowledge for Retrieval Augmented Large Language Models

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models

Retrieval-Augmented Test Generation: How Far Are We?

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

Don't Forget to Connect! Improving RAG with Graph-based Reranking

Corrective Retrieval Augmented Generation

Fine-Grained Guidance for Retrievers: Leveraging LLMs' Feedback in Retrieval-Augmented Generation

RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots