Abstract:State-of-the-art language models (LMs) sometimes generate non-factual hallucinations that misalign with world knowledge. To explore the mechanistic causes of these hallucinations, we create diagnostic datasets with subject-relation queries and adapt interpretability methods to trace hallucinations through internal model representations. We discover two general and distinct mechanistic causes of hallucinations shared across LMs (Llama-2, Pythia, GPT-J): 1) knowledge enrichment hallucinations: insufficient subject attribute knowledge in lower layer MLPs, and 2) answer extraction hallucinations: failure to select the correct object attribute in upper layer attention heads. We also found these two internal mechanistic causes of hallucinations are reflected in external manifestations. Based on insights from our mechanistic analysis, we propose a novel hallucination mitigation method through targeted restoration of the LM's internal fact recall pipeline, demonstrating superior performance compared to baselines.

What problem does this paper attempt to address?

The paper attempts to address the issue of non-factual hallucinations generated by language models (LMs), which are factual errors in the text that do not align with real-world knowledge. Specifically, the paper explores the mechanisms behind these hallucinations by creating a diagnostic dataset and employing interpretability methods, and proposes methods to reduce hallucinations. ### Main Contributions of the Paper: 1. **Mechanism Analysis**: The paper identifies two prevalent and distinct mechanisms of hallucination generation: - **Knowledge Enrichment Hallucinations**: At the lower layers of multi-layer perceptrons (MLPs), the model lacks sufficient thematic attribute knowledge. - **Answer Extraction Hallucinations**: At the higher layers of self-attention heads, the model fails to correctly select relevant object attributes. 2. **External Manifestations**: The paper also finds that these two internal mechanisms of hallucinations differ in their external manifestations, such as their performance in subject-object association strength, robustness to input perturbations, and model prediction uncertainty. 3. **Hallucination Mitigation Methods**: Based on insights from the mechanism analysis, the paper proposes a new hallucination mitigation method (Mechanistic Hallucination Mitigation, MHM), which demonstrates better performance than baseline methods by specifically restoring the internal factual recall pipeline of the language model. ### Research Background: - **Knowledge Storage in Language Models**: In recent years, researchers have focused on knowledge tracing in language models, exploring how specific layers and neurons store factual information. - **Hallucination Detection and Mitigation**: Existing research primarily detects and mitigates hallucinations through external features (such as prediction uncertainty, logical consistency, etc.), but these methods have limited understanding of the internal mechanisms of hallucinations. - **Mechanism Interpretability**: Mechanism interpretability research examines the internal mechanisms of transformer models through white-box methods, identifying components crucial for accurate factual predictions. ### Method Overview: - **Dataset Construction**: The paper collects approximately 80K factual knowledge queries from the ParaRel dataset to evaluate three pre-trained language models (Llama-2, Pythia, GPT-J). - **Mechanism Analysis**: Using methods such as Logit Lens and Causal Mediation Analysis, the paper analyzes the intermediate hidden representations of the models when processing each query, identifying key components that lead to hallucinations. - **Hallucination Mitigation**: The proposed MHM method improves factual accuracy by encouraging the model to retrieve more correct information from MLPs when generating incorrect answers and suppressing the propagation of incorrect information. ### Results: - **Mechanism Validation**: Causal analysis results indicate that lower-layer MLPs and higher-layer self-attention heads are key components leading to non-factual hallucinations. - **External Features**: External feature analysis further validates the distinction between the two hallucination mechanisms. - **Mitigation Effectiveness**: The MHM method significantly reduces hallucinations in the model across multiple open-domain question-answering datasets without significantly affecting the model's original knowledge. In summary, through in-depth mechanism analysis, this paper not only reveals the reasons behind non-factual hallucinations generated by language models but also provides an effective mitigation method, offering new perspectives and tools for future research.

Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations

The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models

Look Within, Why LLMs Hallucinate: A Causal Perspective

Hallucination Detection and Hallucination Mitigation: An Investigation

Towards Mitigating Hallucination in Large Language Models via Self-Reflection

Mitigating Entity-Level Hallucination in Large Language Models

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Quantifying and Attributing the Hallucination of Large Language Models via Association Analysis

Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models

On Large Language Models' Hallucination with Regard to Known Facts

A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

Alleviating Hallucinations of Large Language Models through Induced Hallucinations

The Troubling Emergence of Hallucination in Large Language Models -- An Extensive Definition, Quantification, and Prescriptive Remediations

Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models

FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs

Zero-Resource Hallucination Prevention for Large Language Models

Banishing LLM Hallucinations Requires Rethinking Generalization

Beyond Fine-Tuning: Effective Strategies for Mitigating Hallucinations in Large Language Models for Data Analytics