Abstract:The surge in applications of large language models (LLMs) has prompted concerns about the generation of misleading or fabricated information, known as hallucinations. Therefore, detecting hallucinations has become critical to maintaining trust in LLM-generated content. A primary challenge in learning a truthfulness classifier is the lack of a large amount of labeled truthful and hallucinated data. To address the challenge, we introduce HaloScope, a novel learning framework that leverages the unlabeled LLM generations in the wild for hallucination detection. Such unlabeled data arises freely upon deploying LLMs in the open world, and consists of both truthful and hallucinated information. To harness the unlabeled data, we present an automated membership estimation score for distinguishing between truthful and untruthful generations within unlabeled mixture data, thereby enabling the training of a binary truthfulness classifier on top. Importantly, our framework does not require extra data collection and human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiments show that HaloScope can achieve superior hallucination detection performance, outperforming the competitive rivals by a significant margin. Code is available at <a class="link-external link-https" href="https://github.com/deeplearningwisc/haloscope" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the issue of hallucinations in content generated by large language models (LLMs). Hallucinations refer to information generated by the model that appears to be informative but is actually untrue, which can lead to misleading or incorrect decisions in open-world applications. Therefore, detecting hallucinations is crucial for maintaining trust in LLM-generated content. The paper points out that the main challenge in training a veracity classifier is the lack of a large amount of labeled veracity and hallucination data. To tackle this challenge, the authors propose the HaloScope framework, which utilizes unlabeled LLM-generated content for hallucination detection. This unlabeled data naturally arises when LLMs are deployed in the open world, including both true and hallucinated information. By using automated membership estimation scores to distinguish between true and untrue generated content in the unlabeled mixed data, a binary veracity classifier can be trained. Importantly, this framework does not require additional data collection and manual labeling, providing strong flexibility and practicality for real-world applications. HaloScope achieves its goals through the following steps: 1. **Utilization of Unlabeled LLM-Generated Content**: This data comes from user interactions with chat applications, which can be obtained for free but typically contains both true and potentially hallucinated content. 2. **Identification of Latent Subspaces**: By decomposing LLM representations, latent subspaces related to hallucinated statements are identified. 3. **Membership Estimation Scores**: Based on the latent subspaces, scores are calculated for each sample to distinguish between true and hallucinated generated content. 4. **Training of the Veracity Classifier**: Based on the membership estimation scores, the data is divided into a potential hallucination set and a candidate true set, and then a binary veracity classifier is trained. Experimental results show that HaloScope outperforms existing methods in hallucination detection across multiple datasets. Particularly, when handling the challenging TRUTHFUL QA benchmark, HaloScope's performance significantly improves. Additionally, the paper explores the robustness and scalability of HaloScope on different data distributions and larger-scale LLMs, further validating its effectiveness and practicality.

HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection

Embedding and Gradient Say Wrong: A White-Box Method for Hallucination Detection

The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models

Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach

HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making

Halo: Estimation and Reduction of Hallucinations in Open-Source Weak Large Language Models

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

Hallucination of Multimodal Large Language Models: A Survey

Hallucination Detection for Generative Large Language Models by Bayesian Sequential Estimation

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Cost-Effective Hallucination Detection for LLMs

Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus

KnowHalu: Hallucination Detection via Multi-Form Knowledge Based Factual Checking

Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation

Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection

Towards Detecting LLMs Hallucination via Markov Chain-based Multi-agent Debate Framework

Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models