UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

Xun Liang,Shichao Song,Simin Niu,Zhiyu Li,Feiyu Xiong,Bo Tang,Yezhaohui Wang,Dawei He,Peng Cheng,Zhonghao Wang,Haiying Deng
2024-05-24
Abstract:Large language models (LLMs) have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese language models and the GPT series models to derive professional performance insights regarding hallucination challenges.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the hallucination issues that occur when large language models (LLMs) generate text, which can reduce the practicality of these models in professional scenarios. To evaluate the reliability of LLMs, many studies have developed benchmark evaluation methods specifically for hallucination phenomena. However, most existing benchmark evaluations rely on constrained generation techniques to create evaluation datasets, such as through directed induction or deliberately modifying real text to generate hallucinations. This approach is inconsistent with the unrestricted text generation required in practical applications, and there is currently a lack of a mature Chinese language dataset specifically for evaluating hallucination phenomena. To address the above challenges, the authors developed a new benchmark called UHGEval, which includes nearly unrestricted hallucination texts generated by LLMs. Additionally, a comprehensive benchmark evaluation framework was established to assist subsequent researchers in conducting scalable and reproducible experiments. The study also evaluated multiple well-known Chinese LLMs and the GPT series models to gain a deeper understanding of hallucination phenomena. Specific contributions include: developing an unrestricted hallucination evaluation dataset containing over 5000 items; establishing a unified and diversified evaluation framework covering discriminative, selective, and generative evaluations; and conducting a comprehensive empirical analysis of eight major Chinese LLMs and three classic GPT series models.