Abstract:Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity in knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at $\geq 50$% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. The reproduced codes are available at <a class="link-external link-https" href="https://github.com/VITA-Group/llm-kick" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper aims to address the issues that arise during the compression of large language models (LLMs). Despite the significant achievements of modern large language models, they face high computational and memory costs. Recent studies have shown that compression methods without training and data (pruning and quantization) can achieve 50-60% sparsity and reduce weight bit-width to 3 or 4 bits without significantly affecting perplexity. However, these studies mainly rely on perplexity, a relatively simple evaluation metric. The authors of the paper reassess the effectiveness of existing state-of-the-art compression methods and propose a new framework called "Knowledge-Intensive Compression LLM Benchmark" (LLM-KICK) to redefine the evaluation protocol for compressed LLMs. LLM-KICK includes a series of carefully selected tasks to comprehensively evaluate the capabilities of compressed LLMs in language understanding, reasoning, generation, context retrieval, and summarization. Specifically, the paper reveals the following points: 1. Most state-of-the-art pruning methods show significant performance degradation at lower sparsity rates (such as 25-30%). 2. All pruning methods perform poorly when dealing with structured N:M sparsity patterns. 3. Compared to pruning methods, current state-of-the-art quantization methods better maintain performance. 4. Although compressed LLMs generate fluent and coherent text, they fall short in generating knowledge-rich and factually correct answers. 5. Compressed LLMs with larger architectures but the same number of parameters perform worse than smaller dense models. Through these observations, the paper hopes to promote the development of better LLM compression methods.

Compressing LLMs: The Truth is Rarely Pure and Never Simple

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications

Semantic Compression With Large Language Models

A Survey on Model Compression for Large Language Models

Ranking LLMs by compression

Aggressive Post-Training Compression on Extremely Large Language Models

Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward

Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

Compression Represents Intelligence Linearly

Compressing Large Language Models by Joint Sparsification and Quantization

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching

Activation Sparsity Opportunities for Compressing General Large Language Models

SqueezeLLM: Dense-and-Sparse Quantization

Just CHOP: Embarrassingly Simple LLM Compression