HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection

Xuefeng Du,Chaowei Xiao,Yixuan Li
2024-09-26
Abstract:The surge in applications of large language models (LLMs) has prompted concerns about the generation of misleading or fabricated information, known as hallucinations. Therefore, detecting hallucinations has become critical to maintaining trust in LLM-generated content. A primary challenge in learning a truthfulness classifier is the lack of a large amount of labeled truthful and hallucinated data. To address the challenge, we introduce HaloScope, a novel learning framework that leverages the unlabeled LLM generations in the wild for hallucination detection. Such unlabeled data arises freely upon deploying LLMs in the open world, and consists of both truthful and hallucinated information. To harness the unlabeled data, we present an automated membership estimation score for distinguishing between truthful and untruthful generations within unlabeled mixture data, thereby enabling the training of a binary truthfulness classifier on top. Importantly, our framework does not require extra data collection and human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiments show that HaloScope can achieve superior hallucination detection performance, outperforming the competitive rivals by a significant margin. Code is available at <a class="link-external link-https" href="https://github.com/deeplearningwisc/haloscope" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of hallucinations in content generated by large language models (LLMs). Hallucinations refer to information generated by the model that appears to be informative but is actually untrue, which can lead to misleading or incorrect decisions in open-world applications. Therefore, detecting hallucinations is crucial for maintaining trust in LLM-generated content. The paper points out that the main challenge in training a veracity classifier is the lack of a large amount of labeled veracity and hallucination data. To tackle this challenge, the authors propose the HaloScope framework, which utilizes unlabeled LLM-generated content for hallucination detection. This unlabeled data naturally arises when LLMs are deployed in the open world, including both true and hallucinated information. By using automated membership estimation scores to distinguish between true and untrue generated content in the unlabeled mixed data, a binary veracity classifier can be trained. Importantly, this framework does not require additional data collection and manual labeling, providing strong flexibility and practicality for real-world applications. HaloScope achieves its goals through the following steps: 1. **Utilization of Unlabeled LLM-Generated Content**: This data comes from user interactions with chat applications, which can be obtained for free but typically contains both true and potentially hallucinated content. 2. **Identification of Latent Subspaces**: By decomposing LLM representations, latent subspaces related to hallucinated statements are identified. 3. **Membership Estimation Scores**: Based on the latent subspaces, scores are calculated for each sample to distinguish between true and hallucinated generated content. 4. **Training of the Veracity Classifier**: Based on the membership estimation scores, the data is divided into a potential hallucination set and a candidate true set, and then a binary veracity classifier is trained. Experimental results show that HaloScope outperforms existing methods in hallucination detection across multiple datasets. Particularly, when handling the challenging TRUTHFUL QA benchmark, HaloScope's performance significantly improves. Additionally, the paper explores the robustness and scalability of HaloScope on different data distributions and larger-scale LLMs, further validating its effectiveness and practicality.