Abstract:Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior work lacks a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant subset. Queries in the non-relevant subset contain passages judged as non-relevant, whereas queries in the relevant subset include at least a single judged relevant passage. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant <a class="link-external link-http" href="http://subset.In" rel="external noopener nofollow">this http URL</a> our work, we observe that most models struggle to balance the two capacities. Models such as LLAMA-2 and Orca-2 achieve over 88% hallucination rate on the non-relevant subset. Mistral and LLAMA-3 hallucinate less but can achieve up to a 74.9% error rate on the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on both subsets, highlighting future work necessary to improve LLM robustness. NoMIRACL dataset and evaluation code are available at: <a class="link-external link-https" href="https://github.com/project-miracl/nomiracl" rel="external noopener nofollow">this https URL</a>.

In-Context Learning for Scalable and Online Hallucination Detection in RAGS

Embedding and Gradient Say Wrong: A White-Box Method for Hallucination Detection

Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots

LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

Cost-Effective Hallucination Detection for LLMs

Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

Enhancing Hallucination Detection through Perturbation-Based Synthetic Data Generation in System Responses

Honest AI: Fine-Tuning "Small" Language Models to Say "I Don't Know", and Reducing Hallucination in RAG

MALTO at SemEval-2024 Task 6: Leveraging Synthetic Data for LLM Hallucination Detection

Insights into Classifying and Mitigating LLMs' Hallucinations

Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models

Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost

Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach

Fine-grained Hallucination Detection and Editing for Language Models

"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation

SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection

Chainpoll: A high efficacy method for LLM hallucination detection