Abstract:Massive activations, which manifest in specific feature dimensions of hidden states, introduce a significant bias in large language models (LLMs), leading to an overemphasis on the corresponding token. In this paper, we identify that massive activations originate not from the hidden state but from the intermediate state of a feed-forward network module in an early layer. Expanding on the previous observation that massive activations occur only in specific feature dimensions, we dive deep into the weights that cause massive activations. Specifically, we define top-$k$ massive weights as the weights that contribute to the dimensions with the top-$k$ magnitudes in the intermediate state. When these massive weights are set to zero, the functionality of LLMs is entirely disrupted. However, when all weights except for massive weights are set to zero, it results in a relatively minor performance drop, even though a much larger number of weights are set to zero. This implies that during the pre-training process, learning is dominantly focused on massive weights. Building on this observation, we propose a simple plug-and-play method called MacDrop (massive weights curriculum dropout), to rely less on massive weights during parameter-efficient fine-tuning. This method applies dropout to the pre-trained massive weights, starting with a high dropout probability and gradually decreasing it as fine-tuning progresses. Through experiments, we demonstrate that MacDrop generally improves performance across zero-shot downstream tasks and generation tasks.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve a phenomenon in large language models (LLMs) called "massive activations". Specifically, these massive activations occur in specific feature dimensions of the hidden state and introduce significant biases, leading to over - emphasis on corresponding tokens. Through in - depth research, the authors found that these massive activations do not originate from the hidden state, but from the intermediate state in the feed - forward network module (FFN). #### Main problems: 1. **Origin of massive activations**: The authors point out that massive activations originate from the intermediate state of the feed - forward network module in the early layers, not the hidden state. 2. **Impact of massive weights**: The authors define "top - k massive weights", that is, the weights that produce the maximum magnitude values in the intermediate state. When these massive weights are set to zero, the function of the LLM is completely disrupted; while when all weights except the massive weights are set to zero, the performance degradation is relatively small. This indicates that the pre - training process mainly focuses on the massive weights. 3. **Problems depending on massive weights**: Since LLMs rely too much on these massive weights, the model is unstable when facing attacks or fine - tuning. #### Proposed solutions: To solve this problem, the authors propose a simple and plug - gable method called MacDrop (massive weights curriculum dropout). This method applies dropout to the pre - trained massive weights during parameter - efficient fine - tuning, with an initially high dropout probability and gradually decreasing as the fine - tuning progresses. In this way, the model can reduce its dependence on massive weights, thereby improving performance in zero - shot downstream tasks and generation tasks. #### Experimental results: - **Zero - shot downstream tasks**: The experimental results show that using MacDrop can significantly improve the performance of the model on certain tasks, especially on ARC - Easy and ARC - Challenge tasks. - **Generation tasks**: Although the improvement in generation tasks is limited, MacDrop still shows a certain performance improvement. Overall, this paper reveals potential problems in LLMs by identifying and analyzing massive activations and their related weights, and proposes effective solutions to improve the robustness and generalization ability of the model.

House of Cards: Massive Weights in LLMs

The Super Weight in Large Language Models

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Massive Activations in Large Language Models

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

Data-freeWeight Compress and Denoise for Large Language Models

Learn To be Efficient: Build Structured Sparsity in Large Language Models

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

Activation Sparsity Opportunities for Compressing General Large Language Models

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Training-Free Activation Sparsity in Large Language Models

A Simple and Effective Pruning Approach for Large Language Models

Aggressive Post-Training Compression on Extremely Large Language Models

Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other

Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models

Investigating Layer Importance in Large Language Models

WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language Models

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect