Che Jiang,Biqing Qi,Xiangyu Hong,Dayuan Fu,Yang Cheng,Fandong Meng,Mo Yu,Bowen Zhou,Jie Zhou
Abstract:Large language models are successful in answering factoid questions but are also prone to hallucination. We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics, an area not previously covered in studies on hallucinations. We are able to conduct this analysis via two key ideas. First, we identify the factual questions that query the same triplet knowledge but result in different answers. The difference between the model behaviors on the correct and incorrect outputs hence suggests the patterns when hallucinations happen. Second, to measure the pattern, we utilize mappings from the residual streams to vocabulary space. We reveal the different dynamics of the output token probabilities along the depths of layers between the correct and hallucinated cases. In hallucinated cases, the output token's information rarely demonstrates abrupt increases and consistent superiority in the later stages of the model. Leveraging the dynamic curve as a feature, we build a classifier capable of accurately detecting hallucinatory predictions with an 88\% success rate. Our study shed light on understanding the reasons for LLMs' hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the hallucination phenomenon of large - language models (LLMs) on known facts. Specifically, although these models perform well in answering factual questions, they are also prone to hallucination, that is, they still generate incorrect information even when they have the knowledge of the correct answer. The paper studies this phenomenon from the perspective of inference dynamics, which is an area not covered in previous research. By analyzing the behavioral differences of the model when generating correct and incorrect outputs, the author reveals the patterns of hallucination occurrence and proposes a method for detecting hallucination based on these patterns.
### Main contributions of the paper:
1. **Relationship between hallucination and knowledge recall failure**:
- The study found that when the model generates incorrect outputs, the correct answer has an average frequency of only 30% of becoming the highest - probability output during the inference process, while this frequency is 78% for correct outputs. This indicates that hallucination stems from knowledge recall failure.
2. **Influence of the multi - layer perceptron (MLP) module**:
- Compared with the attention module, the MLP module has a greater impact on incorrect outputs. It not only reduces the probability of the correct answer but also generates incorrect outputs in the final decoding layer.
3. **Observation of output token inference dynamics**:
- When generating correct outputs, the information of output tokens shows a significant increase in the middle - to - late layers; while in incorrect outputs, this increase in information often starts from shallower layers and is not obvious.
4. **Hallucination detection based on dynamic patterns**:
- Using the dynamic curves of output tokens in each layer as features, a classifier was trained, which can accurately detect whether the model is generating hallucinations with a success rate of 88%.
### Method overview:
- **Dataset**: The author modified the queries in the COUNTER FACT dataset and generated more than 30,000 declarative sentences or question - answer pairs to test the performance of the model under different queries.
- **Model**: The Llama2 - 7B - chat model with a typical Transformer architecture was used for the experiment.
- **Observation methods**:
- **Logit Lens**: Allows mapping from the model space to the vocabulary space to observe the changes in the internal state of the model.
- **Tuned Lens**: Further improves the Logit Lens and can more accurately observe the state changes of the model at different layers.
- **Ablation method**: By setting the hidden state at a specific position to zero, the change of the output token is observed to analyze the contribution of each module.
### Experimental results:
- **Accuracy statistics**: The study found that the popularity of knowledge has no significant impact on the occurrence of hallucination. There is no obvious correlation between error types (uncertain responses, irrelevant information, wrong entities) and the popularity of knowledge.
- **Lens observation**: Through Logit Lens and Tuned Lens, it was observed that successfully recalled knowledge is extracted near the middle layer (about the 20th layer), while incorrect outputs start to appear in the early layers.
- **Module contribution**: The MLP module plays a greater role in knowledge recall failure, especially in the later stages of the model, and has a significant impact on the generation of incorrect outputs.
- **Dynamic pattern**: Based on the dynamic curves of output tokens, a linear SVM model was trained, which can effectively detect whether the model is generating hallucinations.
### Conclusion:
This study delved into the hallucination phenomenon of large - language models on known facts, revealed the mechanism of hallucination generation, and proposed a detection method based on dynamic patterns. These findings are helpful for improving the reliability and accuracy of the model in practical applications.