HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Wen Luo,Tianshu Shen,Wei Li,Guangyue Peng,Richeng Xuan,Houfeng Wang,Xi Yang

2024-06-11

Abstract:Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at <a class="link-external link-https" href="https://github.com/FlagOpen/HalluDial" rel="external noopener nofollow">this https URL</a>.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the hallucination problem in large language models (LLMs) when generating text, particularly in the context of dialogue-level hallucination evaluation. Specifically, the paper proposes a large-scale benchmark dataset named HalluDial for automatic dialogue-level hallucination evaluation. Current hallucination evaluation benchmarks mainly focus on sentence or paragraph-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and explanation of causes. Moreover, these benchmarks mostly focus on factual hallucinations while ignoring fidelity hallucinations and rely on time-consuming manual annotations or non-expert evaluators. HalluDial covers both spontaneous and induced hallucination scenarios, including factual and fidelity hallucinations. The dataset contains 4,094 dialogues, totaling 146,856 samples. Utilizing HalluDial, the authors conducted a meta-evaluation of LLMs' hallucination evaluation capabilities and introduced a dedicated evaluation model named HalluJudge. HalluJudge excels in hallucination detection, localization, and explanation, and can be used for automatic evaluation of hallucinations in LLM-generated content in information-seeking dialogues. The main contribution of the paper is providing the first large-scale dialogue-level hallucination benchmark dataset and developing a high-performance hallucination evaluation model, HalluJudge.

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

PhD: A Prompted Visual Hallucination Evaluation Dataset

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine

VisDiaHalBench: A Visual Dialogue Benchmark For Diagnosing Hallucination in Large Vision-Language Models

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

Evaluation and Analysis of Hallucination in Large Vision-Language Models

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs

Unified Hallucination Detection for Multimodal Large Language Models

Hallucination Detection and Hallucination Mitigation: An Investigation

AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models