UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

Xun Liang,Shichao Song,Simin Niu,Zhiyu Li,Feiyu Xiong,Bo Tang,Yezhaohui Wang,Dawei He,Peng Cheng,Zhonghao Wang,Haiying Deng

2024-05-24

Abstract:Large language models (LLMs) have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese language models and the GPT series models to derive professional performance insights regarding hallucination challenges.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the hallucination issues that occur when large language models (LLMs) generate text, which can reduce the practicality of these models in professional scenarios. To evaluate the reliability of LLMs, many studies have developed benchmark evaluation methods specifically for hallucination phenomena. However, most existing benchmark evaluations rely on constrained generation techniques to create evaluation datasets, such as through directed induction or deliberately modifying real text to generate hallucinations. This approach is inconsistent with the unrestricted text generation required in practical applications, and there is currently a lack of a mature Chinese language dataset specifically for evaluating hallucination phenomena. To address the above challenges, the authors developed a new benchmark called UHGEval, which includes nearly unrestricted hallucination texts generated by LLMs. Additionally, a comprehensive benchmark evaluation framework was established to assist subsequent researchers in conducting scalable and reproducible experiments. The study also evaluated multiple well-known Chinese LLMs and the GPT series models to gain a deeper understanding of hallucination phenomena. Specific contributions include: developing an unrestricted hallucination evaluation dataset containing over 5000 items; establishing a unified and diversified evaluation framework covering discriminative, selective, and generative evaluations; and conducting a comprehensive empirical analysis of eight major Chinese LLMs and three classic GPT series models.

UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Evaluating Hallucinations in Chinese Large Language Models

DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Evaluation and Analysis of Hallucination in Large Vision-Language Models

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models

The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models

Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Hallucination Detection for Generative Large Language Models by Bayesian Sequential Estimation

Hallucination of Multimodal Large Language Models: A Survey

AutoHall: Automated Hallucination Dataset Generation for Large Language Models

Unified Hallucination Detection for Multimodal Large Language Models