Abstract:Retrieval Augmented Generation (RAG) is widely used to enable Large Language Models (LLMs) perform Question Answering (QA) tasks in various domains. However, RAG based on open-source LLM for specialized domains has challenges of evaluating generated responses. A popular framework in the literature is the RAG Assessment (RAGAS), a publicly available library which uses LLMs for evaluation. One disadvantage of RAGAS is the lack of details of derivation of numerical value of the evaluation metrics. One of the outcomes of this work is a modified version of this package for few metrics (faithfulness, context relevance, answer relevance, answer correctness, answer similarity and factual correctness) through which we provide the intermediate outputs of the prompts by using any LLMs. Next, we analyse the expert evaluations of the output of the modified RAGAS package and observe the challenges of using it in the telecom domain. We also study the effect of the metrics under correct vs. wrong retrieval and observe that few of the metrics have higher values for correct retrieval. We also study for differences in metrics between base embeddings and those domain adapted via pre-training and fine-tuning. Finally, we comment on the suitability and challenges of using these metrics for in-the-wild telecom QA task.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily explores the effectiveness of the Retrieval-Augmented Generation (RAG) method in evaluating question-answering (QA) systems in the telecommunications domain. Specifically, the paper focuses on the following aspects: 1. **Improvement of the RAGAS Framework**: - Researchers have improved the existing RAGAS framework to provide intermediate outputs for each step in the answer generation process. This helps in better understanding and interpreting the evaluation metrics. 2. **Applicability of Evaluation in the Telecommunications Domain**: - Using the modified RAGAS framework, the performance of RAG systems in the telecommunications domain is evaluated. Researchers are concerned about whether the specialized terminology in the telecommunications field affects the evaluation results. 3. **Factors Influencing Evaluation Metrics**: - The paper explores the impact of different retriever performances, domain-adapted embeddings, and instruction-tuned LLMs on RAGAS evaluation metrics. The research questions (RQs) of the paper include: - **RQ1**: How are LLM evaluation metrics (such as RAGAS) based on specific prompts evaluated step-by-step? - **RQ2**: Are RAGAS metrics applicable to QA tasks in the telecommunications domain? - **RQ3**: Are RAGAS metrics affected by retriever performance, domain-adapted embeddings, and instruction tuning? The main contributions of the paper are as follows: 1. Improved the RAGAS public codebase by recording all intermediate outputs of prompts used to calculate RAGAS metrics. 2. Conducted a manual evaluation of the intermediate outputs and analyzed their applicability to telecommunications domain data. 3. Established two metrics (factual correctness and faithfulness) as good indicators of RAG response correctness and demonstrated how the use of domain-adapted LLMs further improves these metrics. 4. Highlighted that the factual correctness metric improves after generator fine-tuning, regardless of whether the retrieval context is correct. Additionally, it was found that correct answers derived from incorrect retrieval contexts lead to lower faithfulness metrics, indicating that the generator (LLM) may be utilizing information outside the context.

Evaluation of RAG Metrics for Question Answering in the Telecom Domain

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

RAG based Question-Answering for Contextual Response Prediction System

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

CRAG -- Comprehensive RAG Benchmark

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

TelecomRAG: Taming Telecom Standards with Retrieval Augmented Generation and LLMs

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering

Telco-RAG: Navigating the Challenges of Retrieval-Augmented Language Models for Telecommunications

ERATTA: Extreme RAG for Table To Answers with Large Language Models

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

RAGProbe: An Automated Approach for Evaluating RAG Applications

Learning When to Retrieve, What to Rewrite, and How to Respond in Conversational QA

Advancing Question-Answering in Ophthalmology with Retrieval Augmented Generations (RAG): Benchmarking Open-source and Proprietary Large Language Models

R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

RAG4ITOps: A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance