Abstract:Large Language Models (LLMs) have transformed the landscape of artificial intelligence, while their enormous size presents significant challenges in terms of computational costs. We introduce LoRAShear, a novel efficient approach to structurally prune LLMs and recover knowledge. Given general LLMs, LoRAShear at first creates the dependency graphs over LoRA modules to discover minimally removal structures and analyze the knowledge distribution. It then proceeds progressive structured pruning on LoRA adaptors and enables inherent knowledge transfer to better preserve the information in the redundant structures. To recover the lost knowledge during pruning, LoRAShear meticulously studies and proposes a dynamic fine-tuning schemes with dynamic data adaptors to effectively narrow down the performance gap to the full models. Numerical results demonstrate that by only using one GPU within a couple of GPU days, LoRAShear effectively reduced footprint of LLMs by 20% with only 1.0% performance degradation and significantly outperforms state-of-the-arts. The source code will be available at <a class="link-external link-https" href="https://github.com/microsoft/lorashear" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper primarily addresses the issues of high computational cost and resource consumption in large language models (LLMs) by proposing a new method called LoRAShear. LoRAShear aims to efficiently compress LLMs through structured pruning and knowledge recovery, thereby significantly reducing the model size while maintaining high performance under limited resource conditions. Specifically, LoRAShear addresses the following issues: 1. **Automatic Discovery of Minimal Removable Structures**: By analyzing dependency graphs, it automatically identifies the smallest units that can be removed without affecting the model's functionality. 2. **Knowledge Distribution Analysis**: It analyzes the distribution of knowledge across different model components to determine which parts are crucial for model performance, thereby avoiding the removal of critical structures during pruning. 3. **Progressive Structured Pruning**: A new algorithm called LoRA Half-Space Projected Gradient (LHSPG) is proposed to progressively identify and remove redundant structures based on information from LoRA modules, and transfer the knowledge contained in these structures to more important ones to retain as much of the original model's knowledge as possible. 4. **Dynamic Knowledge Recovery**: By dynamically selecting and fine-tuning pre-training and instruction fine-tuning datasets, it effectively recovers the knowledge lost during the pruning process. Experimental results show that with a 20% pruning rate, LoRAShear loses only about 1% of performance compared to the full model; and with a 50% pruning rate, LoRAShear still retains 82% of the original model's performance, significantly outperforming existing methods. In summary, LoRAShear aims to achieve effective compression of large language models through structured pruning and efficient knowledge recovery strategies in resource-constrained environments, while minimizing performance loss as much as possible.

LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery

LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning

Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

LLM-Pruner: On the Structural Pruning of Large Language Models

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

Search for Efficient Large Language Models

LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

Fluctuation-based Adaptive Structured Pruning for Large Language Models

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

LaCo: Large Language Model Pruning via Layer Collapse