Abstract:Large language models (LLMs) have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese language models and the GPT series models to derive professional performance insights regarding hallucination challenges.

Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models

Hallucination Detection and Hallucination Mitigation: An Investigation

DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Embedding and Gradient Say Wrong: A White-Box Method for Hallucination Detection

Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach

Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

Towards Detecting LLMs Hallucination via Markov Chain-based Multi-agent Debate Framework

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Unified Hallucination Detection for Multimodal Large Language Models

AutoHall: Automated Hallucination Dataset Generation for Large Language Models

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Hallucination Detection for Generative Large Language Models by Bayesian Sequential Estimation

Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code

The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models