Abstract:Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM's ability to generate text -- increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods. For weight quantization, we similarly find that by preserving the super weight and clipping other weight outliers, round-to-nearest quantization can scale to much larger block sizes than previously considered. To facilitate further research into super weights, we provide an index of super weight coordinates for common, openly available LLMs.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper explores a special class of parameters in large - language models (LLMs), called "super weights", and reveals their extreme importance to model performance. Specifically, the paper mainly focuses on the following aspects: 1. **The influence of super weights**: - The study found that even pruning just one super weight can greatly undermine the LLM's ability to generate text. For example, in the Llama - 7B model, after pruning super weights, the accuracy of zero - shot tasks almost dropped to zero, and the perplexity increased by three orders of magnitude. 2. **Methods for identifying super weights**: - A dataset - free method is proposed, which can identify these super weights through a single forward pass and provides the super - weight indices in common open - source LLMs. 3. **The role of super activations**: - It analyzes how super weights affect the inference process, especially their relationship with the activation outliers observed in previous studies. Super weights amplify the input activation, producing so - called "super activations", which maintain a constant position and magnitude in the model. 4. **Improvement of compression methods**: - By preserving super weights and super activations, the paper shows that the effectiveness of round - to - nearest quantization is significantly improved and can be comparable to the state - of - the - art quantization methods. Specifically, for activation quantization, this method competes with SmoothQuant; for weight quantization, it can be extended to larger block sizes by preserving super weights and clipping other outliers. ### Main contributions of the paper 1. **Discovery of super weights**: Revealed that there are a small number (up to six scalars) of super weights in LLMs, and these weights are disproportionately important to model quality. 2. **Method for identifying super weights**: Proposed a dataset - free method that requires only a single forward pass to identify super weights and provided super - weight indices for existing open - source LLMs. 3. **Analysis of super activations**: Studied how super weights affect the inference process and related them to activation outliers. 4. **Improvement of compression methods**: By preserving super outliers, showed that the effectiveness of round - to - nearest quantization is significantly improved, improving the compression quality. ### Related work The paper reviews existing research on outliers in LLMs, including weight outliers and activation outliers, and discusses existing quantization methods. In particular, the authors emphasize that the uniqueness of their work lies in utilizing rather than alleviating these super outliers. ### Experimental results The experimental part verifies the importance of super weights and super activations and shows the effectiveness of the proposed quantization method on multiple LLMs. The results indicate that simply preserving super activations can significantly improve the quality of the quantized model, especially when dealing with activation quantization, with an effect close to or exceeding the complex SmoothQuant method. Through these studies, the paper provides a new perspective for understanding and optimizing large - language models, especially in terms of model compression and quantization.

The Super Weight in Large Language Models

Data-freeWeight Compress and Denoise for Large Language Models

A Simple and Effective Pruning Approach for Large Language Models

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other

House of Cards: Massive Weights in LLMs

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Foundations of Large Language Model Compression -- Part 1: Weight Quantization

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

BiSup: Bidirectional Quantization Error Suppression for Large Language Models

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

Understanding the difficulty of low-precision post-training quantization of large language models

What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

GWQ: Gradient-Aware Weight Quantization for Large Language Models