House of Cards: Massive Weights in LLMs

Jaehoon Oh,Seungjun Shin,Dokwan Oh
2024-10-02
Abstract:Massive activations, which manifest in specific feature dimensions of hidden states, introduce a significant bias in large language models (LLMs), leading to an overemphasis on the corresponding token. In this paper, we identify that massive activations originate not from the hidden state but from the intermediate state of a feed-forward network module in an early layer. Expanding on the previous observation that massive activations occur only in specific feature dimensions, we dive deep into the weights that cause massive activations. Specifically, we define top-$k$ massive weights as the weights that contribute to the dimensions with the top-$k$ magnitudes in the intermediate state. When these massive weights are set to zero, the functionality of LLMs is entirely disrupted. However, when all weights except for massive weights are set to zero, it results in a relatively minor performance drop, even though a much larger number of weights are set to zero. This implies that during the pre-training process, learning is dominantly focused on massive weights. Building on this observation, we propose a simple plug-and-play method called MacDrop (massive weights curriculum dropout), to rely less on massive weights during parameter-efficient fine-tuning. This method applies dropout to the pre-trained massive weights, starting with a high dropout probability and gradually decreasing it as fine-tuning progresses. Through experiments, we demonstrate that MacDrop generally improves performance across zero-shot downstream tasks and generation tasks.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve a phenomenon in large language models (LLMs) called "massive activations". Specifically, these massive activations occur in specific feature dimensions of the hidden state and introduce significant biases, leading to over - emphasis on corresponding tokens. Through in - depth research, the authors found that these massive activations do not originate from the hidden state, but from the intermediate state in the feed - forward network module (FFN). #### Main problems: 1. **Origin of massive activations**: The authors point out that massive activations originate from the intermediate state of the feed - forward network module in the early layers, not the hidden state. 2. **Impact of massive weights**: The authors define "top - k massive weights", that is, the weights that produce the maximum magnitude values in the intermediate state. When these massive weights are set to zero, the function of the LLM is completely disrupted; while when all weights except the massive weights are set to zero, the performance degradation is relatively small. This indicates that the pre - training process mainly focuses on the massive weights. 3. **Problems depending on massive weights**: Since LLMs rely too much on these massive weights, the model is unstable when facing attacks or fine - tuning. #### Proposed solutions: To solve this problem, the authors propose a simple and plug - gable method called MacDrop (massive weights curriculum dropout). This method applies dropout to the pre - trained massive weights during parameter - efficient fine - tuning, with an initially high dropout probability and gradually decreasing as the fine - tuning progresses. In this way, the model can reduce its dependence on massive weights, thereby improving performance in zero - shot downstream tasks and generation tasks. #### Experimental results: - **Zero - shot downstream tasks**: The experimental results show that using MacDrop can significantly improve the performance of the model on certain tasks, especially on ARC - Easy and ARC - Challenge tasks. - **Generation tasks**: Although the improvement in generation tasks is limited, MacDrop still shows a certain performance improvement. Overall, this paper reveals potential problems in LLMs by identifying and analyzing massive activations and their related weights, and proposes effective solutions to improve the robustness and generalization ability of the model.