The Super Weight in Large Language Models

Mengxia Yu,De Wang,Qi Shan,Colorado Reed,Alvin Wan
2024-11-12
Abstract:Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM's ability to generate text -- increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods. For weight quantization, we similarly find that by preserving the super weight and clipping other weight outliers, round-to-nearest quantization can scale to much larger block sizes than previously considered. To facilitate further research into super weights, we provide an index of super weight coordinates for common, openly available LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper explores a special class of parameters in large - language models (LLMs), called "super weights", and reveals their extreme importance to model performance. Specifically, the paper mainly focuses on the following aspects: 1. **The influence of super weights**: - The study found that even pruning just one super weight can greatly undermine the LLM's ability to generate text. For example, in the Llama - 7B model, after pruning super weights, the accuracy of zero - shot tasks almost dropped to zero, and the perplexity increased by three orders of magnitude. 2. **Methods for identifying super weights**: - A dataset - free method is proposed, which can identify these super weights through a single forward pass and provides the super - weight indices in common open - source LLMs. 3. **The role of super activations**: - It analyzes how super weights affect the inference process, especially their relationship with the activation outliers observed in previous studies. Super weights amplify the input activation, producing so - called "super activations", which maintain a constant position and magnitude in the model. 4. **Improvement of compression methods**: - By preserving super weights and super activations, the paper shows that the effectiveness of round - to - nearest quantization is significantly improved and can be comparable to the state - of - the - art quantization methods. Specifically, for activation quantization, this method competes with SmoothQuant; for weight quantization, it can be extended to larger block sizes by preserving super weights and clipping other outliers. ### Main contributions of the paper 1. **Discovery of super weights**: Revealed that there are a small number (up to six scalars) of super weights in LLMs, and these weights are disproportionately important to model quality. 2. **Method for identifying super weights**: Proposed a dataset - free method that requires only a single forward pass to identify super weights and provided super - weight indices for existing open - source LLMs. 3. **Analysis of super activations**: Studied how super weights affect the inference process and related them to activation outliers. 4. **Improvement of compression methods**: By preserving super outliers, showed that the effectiveness of round - to - nearest quantization is significantly improved, improving the compression quality. ### Related work The paper reviews existing research on outliers in LLMs, including weight outliers and activation outliers, and discusses existing quantization methods. In particular, the authors emphasize that the uniqueness of their work lies in utilizing rather than alleviating these super outliers. ### Experimental results The experimental part verifies the importance of super weights and super activations and shows the effectiveness of the proposed quantization method on multiple LLMs. The results indicate that simply preserving super activations can significantly improve the quality of the quantized model, especially when dealing with activation quantization, with an effect close to or exceeding the complex SmoothQuant method. Through these studies, the paper provides a new perspective for understanding and optimizing large - language models, especially in terms of model compression and quantization.