Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations

Lei Yu,Meng Cao,Jackie Chi Kit Cheung,Yue Dong
2024-06-18
Abstract:State-of-the-art language models (LMs) sometimes generate non-factual hallucinations that misalign with world knowledge. To explore the mechanistic causes of these hallucinations, we create diagnostic datasets with subject-relation queries and adapt interpretability methods to trace hallucinations through internal model representations. We discover two general and distinct mechanistic causes of hallucinations shared across LMs (Llama-2, Pythia, GPT-J): 1) knowledge enrichment hallucinations: insufficient subject attribute knowledge in lower layer MLPs, and 2) answer extraction hallucinations: failure to select the correct object attribute in upper layer attention heads. We also found these two internal mechanistic causes of hallucinations are reflected in external manifestations. Based on insights from our mechanistic analysis, we propose a novel hallucination mitigation method through targeted restoration of the LM's internal fact recall pipeline, demonstrating superior performance compared to baselines.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of non-factual hallucinations generated by language models (LMs), which are factual errors in the text that do not align with real-world knowledge. Specifically, the paper explores the mechanisms behind these hallucinations by creating a diagnostic dataset and employing interpretability methods, and proposes methods to reduce hallucinations. ### Main Contributions of the Paper: 1. **Mechanism Analysis**: The paper identifies two prevalent and distinct mechanisms of hallucination generation: - **Knowledge Enrichment Hallucinations**: At the lower layers of multi-layer perceptrons (MLPs), the model lacks sufficient thematic attribute knowledge. - **Answer Extraction Hallucinations**: At the higher layers of self-attention heads, the model fails to correctly select relevant object attributes. 2. **External Manifestations**: The paper also finds that these two internal mechanisms of hallucinations differ in their external manifestations, such as their performance in subject-object association strength, robustness to input perturbations, and model prediction uncertainty. 3. **Hallucination Mitigation Methods**: Based on insights from the mechanism analysis, the paper proposes a new hallucination mitigation method (Mechanistic Hallucination Mitigation, MHM), which demonstrates better performance than baseline methods by specifically restoring the internal factual recall pipeline of the language model. ### Research Background: - **Knowledge Storage in Language Models**: In recent years, researchers have focused on knowledge tracing in language models, exploring how specific layers and neurons store factual information. - **Hallucination Detection and Mitigation**: Existing research primarily detects and mitigates hallucinations through external features (such as prediction uncertainty, logical consistency, etc.), but these methods have limited understanding of the internal mechanisms of hallucinations. - **Mechanism Interpretability**: Mechanism interpretability research examines the internal mechanisms of transformer models through white-box methods, identifying components crucial for accurate factual predictions. ### Method Overview: - **Dataset Construction**: The paper collects approximately 80K factual knowledge queries from the ParaRel dataset to evaluate three pre-trained language models (Llama-2, Pythia, GPT-J). - **Mechanism Analysis**: Using methods such as Logit Lens and Causal Mediation Analysis, the paper analyzes the intermediate hidden representations of the models when processing each query, identifying key components that lead to hallucinations. - **Hallucination Mitigation**: The proposed MHM method improves factual accuracy by encouraging the model to retrieve more correct information from MLPs when generating incorrect answers and suppressing the propagation of incorrect information. ### Results: - **Mechanism Validation**: Causal analysis results indicate that lower-layer MLPs and higher-layer self-attention heads are key components leading to non-factual hallucinations. - **External Features**: External feature analysis further validates the distinction between the two hallucination mechanisms. - **Mitigation Effectiveness**: The MHM method significantly reduces hallucinations in the model across multiple open-domain question-answering datasets without significantly affecting the model's original knowledge. In summary, through in-depth mechanism analysis, this paper not only reveals the reasons behind non-factual hallucinations generated by language models but also provides an effective mitigation method, offering new perspectives and tools for future research.