Abstract:Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs. However, most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures. Besides, traditional architecture search methods, limited by the elevated complexity with extensive parameters, struggle to demonstrate their effectiveness on LLMs. In this paper, we propose a training-free architecture search framework to identify optimal subnets that preserve the fundamental strengths of the original LLMs while achieving inference acceleration. Furthermore, after generating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inherited weights with a small amount of calibration data. Compared with SOTA training-free structured pruning works that can generate smaller networks, our method demonstrates superior performance across standard benchmarks. Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve inference acceleration. Code: <a class="link-external link-https" href="https://github.com/shawnricecake/search-llm" rel="external noopener nofollow">this https URL</a>

Data-freeWeight Compress and Denoise for Large Language Models

Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

Aggressive Post-Training Compression on Extremely Large Language Models

Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models

A Simple and Effective Pruning Approach for Large Language Models

Search for Efficient Large Language Models

A Survey on Model Compression for Large Language Models

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

The Super Weight in Large Language Models

Foundations of Large Language Model Compression -- Part 1: Weight Quantization

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

LCQ: Low-Rank Codebook based Quantization for Large Language Models

A Comprehensive Study on Quantization Techniques for Large Language Models

LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Structured Pruning of Large Language Models

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit