Abstract:Large language models (LLMs) have transformed the landscape of language processing, yet struggle with significant challenges in terms of security, privacy, and the generation of seemingly coherent but factually inaccurate outputs, commonly referred to as hallucinations. Among these challenges, one particularly pressing issue is Fact-Conflicting Hallucination (FCH), where LLMs generate content that directly contradicts established facts. Tackling FCH poses a formidable task due to two primary obstacles: Firstly, automating the construction and updating of benchmark datasets is challenging, as current methods rely on static benchmarks that don't cover the diverse range of FCH scenarios. Secondly, validating LLM outputs' reasoning process is inherently complex, especially with intricate logical relations involved. In addressing these obstacles, we propose an innovative approach leveraging logic programming to enhance metamorphic testing for detecting Fact-Conflicting Hallucinations (FCH). Our method gathers data from sources like Wikipedia, expands it with logical reasoning to create diverse test cases, assesses LLMs through structured prompts, and validates their coherence using semantic-aware assessment mechanisms. Our method generates test cases and detects hallucinations across six different LLMs spanning nine domains, revealing hallucination rates ranging from 24.7% to 59.8%. Key observations indicate that LLMs encounter challenges, particularly with temporal concepts, handling out-of-distribution knowledge, and exhibiting deficiencies in logical reasoning capabilities. The outcomes underscore the efficacy of logic-based test cases generated by our tool in both triggering and identifying hallucinations. These findings underscore the imperative for ongoing collaborative endeavors within the community to detect and address LLM hallucinations.

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation

Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code

Code Hallucination

CodeMirage: Hallucinations in Code Generated by Large Language Models

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models

DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models

Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach

Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector

Hallucination Detection and Hallucination Mitigation: An Investigation

Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models

De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models