HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Wen Luo,Tianshu Shen,Wei Li,Guangyue Peng,Richeng Xuan,Houfeng Wang,Xi Yang
2024-06-11
Abstract:Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at <a class="link-external link-https" href="https://github.com/FlagOpen/HalluDial" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the hallucination problem in large language models (LLMs) when generating text, particularly in the context of dialogue-level hallucination evaluation. Specifically, the paper proposes a large-scale benchmark dataset named HalluDial for automatic dialogue-level hallucination evaluation. Current hallucination evaluation benchmarks mainly focus on sentence or paragraph-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and explanation of causes. Moreover, these benchmarks mostly focus on factual hallucinations while ignoring fidelity hallucinations and rely on time-consuming manual annotations or non-expert evaluators. HalluDial covers both spontaneous and induced hallucination scenarios, including factual and fidelity hallucinations. The dataset contains 4,094 dialogues, totaling 146,856 samples. Utilizing HalluDial, the authors conducted a meta-evaluation of LLMs' hallucination evaluation capabilities and introduced a dedicated evaluation model named HalluJudge. HalluJudge excels in hallucination detection, localization, and explanation, and can be used for automatic evaluation of hallucinations in LLM-generated content in information-seeking dialogues. The main contribution of the paper is providing the first large-scale dialogue-level hallucination benchmark dataset and developing a high-performance hallucination evaluation model, HalluJudge.