Distinguishing Ignorance from Error in LLM Hallucinations

Adi Simhi,Jonathan Herzig,Idan Szpektor,Yonatan Belinkov
2024-10-29
Abstract:Large language models (LLMs) are susceptible to hallucinations-outputs that are ungrounded, factually incorrect, or inconsistent with prior generations. We focus on close-book Question Answering (CBQA), where previous work has not fully addressed the distinction between two possible kinds of hallucinations, namely, whether the model (1) does not hold the correct answer in its parameters or (2) answers incorrectly despite having the required knowledge. We argue that distinguishing these cases is crucial for detecting and mitigating hallucinations. Specifically, case (2) may be mitigated by intervening in the model's internal computation, as the knowledge resides within the model's parameters. In contrast, in case (1) there is no parametric knowledge to leverage for mitigation, so it should be addressed by resorting to an external knowledge source or abstaining. To help distinguish between the two cases, we introduce Wrong Answer despite having Correct Knowledge (WACK), an approach for constructing model-specific datasets for the second hallucination type. Our probing experiments indicate that the two kinds of hallucinations are represented differently in the model's inner states. Next, we show that datasets constructed using WACK exhibit variations across models, demonstrating that even when models share knowledge of certain facts, they still vary in the specific examples that lead to hallucinations. Finally, we show that training a probe on our WACK datasets leads to better hallucination detection of case (2) hallucinations than using the common generic one-size-fits-all datasets. The code is available at <a class="link-external link-https" href="https://github.com/technion-cs-nlp/hallucination-mitigation" rel="external noopener nofollow">this https URL</a> .
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper primarily focuses on the hallucination problem in large language models (LLMs) during closed-book question answering (CBQA) tasks. Specifically, the paper attempts to distinguish between two different types of hallucinations: 1. **Hallucinations due to lack of knowledge (HK−)**: The model does not have the correct answer in its parameters, resulting in incorrect output. 2. **Hallucinations despite having the correct knowledge (HK+)**: The model has the correct knowledge in its parameters but still produces incorrect output in certain situations. The paper argues that distinguishing between these two types of hallucinations is crucial for detecting and mitigating hallucination problems. For HK− type hallucinations, the issue can be addressed by introducing external knowledge sources or opting not to answer; for HK+ type hallucinations, the errors can be corrected by intervening in the model's internal computations. ### Main Contributions 1. **Proposing the WACK method**: This is a method for constructing model-specific datasets that include hallucinations due to lack of knowledge (HK−) and hallucinations despite having the correct knowledge (HK+). The authors will release the datasets used in their experiments. 2. **Discriminative power of the model's internal states**: Demonstrating that the model's internal states can be used to distinguish between these two types of hallucinations. 3. **Importance of model-specific datasets**: Showing that model-specific datasets are more effective in detecting HK+ type hallucinations compared to general datasets. ### Experimental Design - **Dataset Construction**: The authors used two common closed-book question answering datasets (TriviaQA and NaturalQuestions) and constructed model-specific datasets for three different sizes of LLMs (Mistral-7B-v0.3, Llama-3.1-8B, and Gemma-2-9B). - **Hallucination Detection**: By training classifiers to detect different types of hallucinations and comparing the effectiveness of model-specific datasets versus general datasets. ### Results - **Different types of hallucinations are represented differently in the model's internal states**: By training classifiers, the authors found that the model can distinguish between different types of hallucinations in its internal states. - **Model-specific datasets are more effective**: Compared to general datasets, model-specific datasets perform better in detecting HK+ type hallucinations, indicating that detection methods tailored to specific models can capture subtle differences unique to the model, thereby more reliably identifying hallucinations. ### Conclusion By proposing the WACK method, this paper successfully addresses the problem of distinguishing between two different types of hallucinations and demonstrates the importance of model-specific datasets in hallucination detection. These results provide new directions for future research and help improve the reliability and accuracy of LLMs.