Compressing LLMs: The Truth is Rarely Pure and Never Simple

Ajay Jaiswal,Zhe Gan,Xianzhi Du,Bowen Zhang,Zhangyang Wang,Yinfei Yang
2024-03-17
Abstract:Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity in knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at $\geq 50$% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. The reproduced codes are available at <a class="link-external link-https" href="https://github.com/VITA-Group/llm-kick" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issues that arise during the compression of large language models (LLMs). Despite the significant achievements of modern large language models, they face high computational and memory costs. Recent studies have shown that compression methods without training and data (pruning and quantization) can achieve 50-60% sparsity and reduce weight bit-width to 3 or 4 bits without significantly affecting perplexity. However, these studies mainly rely on perplexity, a relatively simple evaluation metric. The authors of the paper reassess the effectiveness of existing state-of-the-art compression methods and propose a new framework called "Knowledge-Intensive Compression LLM Benchmark" (LLM-KICK) to redefine the evaluation protocol for compressed LLMs. LLM-KICK includes a series of carefully selected tasks to comprehensively evaluate the capabilities of compressed LLMs in language understanding, reasoning, generation, context retrieval, and summarization. Specifically, the paper reveals the following points: 1. Most state-of-the-art pruning methods show significant performance degradation at lower sparsity rates (such as 25-30%). 2. All pruning methods perform poorly when dealing with structured N:M sparsity patterns. 3. Compared to pruning methods, current state-of-the-art quantization methods better maintain performance. 4. Although compressed LLMs generate fluent and coherent text, they fall short in generating knowledge-rich and factually correct answers. 5. Compressed LLMs with larger architectures but the same number of parameters perform worse than smaller dense models. Through these observations, the paper hopes to promote the development of better LLM compression methods.