Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging

Deyuan Liu,Zhanyue Qin,Hairu Wang,Zhao Yang,Zecheng Wang,Fangying Rong,Qingbin Liu,Yanchao Hao,Xi Chen,Cunhang Fan,Zhao Lv,Zhiying Tu,Dianhui Chu,Bo Li,Dianbo Sui

2024-06-24

Abstract:While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach that uses manifold learning and the Normalized Pairwise Information Bottleneck (NPIB) measure to merge similar layers, reducing model size while preserving essential performance. We evaluate MKA on multiple benchmark datasets and various LLMs. Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods. Moreover, when coupled with quantization, MKA delivers even greater compression. Specifically, on the MMLU dataset using the Llama3-8B model, MKA achieves a compression ratio of 43.75% with a minimal performance decrease of only 2.82\%. The proposed MKA method offers a resource-efficient and performance-preserving model compression technique for LLMs.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to address the complexity and scale challenges faced by large - language models (LLMs) in resource - constrained environments. Although LLMs perform well in many fields, their large number of parameters and computational requirements make it difficult to deploy on resource - limited devices. Current compression techniques, such as parameter pruning, are often unable to effectively utilize the knowledge in the pruned parameters. To solve these problems, the authors propose a manifold - based knowledge alignment and layer - merging compression method (Manifold - Based Knowledge Alignment and Layer Merging Compression, MKA). MKA merges similar layers through manifold learning and the normalized pairwise information bottleneck (Normalized Pairwise Information Bottleneck, NPIB) metric, thereby maintaining performance while reducing the model size. Experimental results show that MKA can not only maintain model performance but also achieve a significant compression ratio, outperforming traditional pruning methods. Moreover, when combined with quantization techniques, MKA can achieve a higher compression effect. Specifically, when using the Llama3 - 8B model on the MMLU dataset, MKA can achieve a 43.75% compression ratio with only a 2.82% performance drop. The MKA method provides a resource - efficient and performance - maintaining model compression technique for LLMs.

Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

Pruning Foundation Models for High Accuracy without Retraining

Pruning as a Domain-specific LLM Extractor

LLM-Pruner: On the Structural Pruning of Large Language Models

Reassessing Layer Pruning in LLMs: New Insights and Methods

Compact Language Models via Pruning and Knowledge Distillation

Streamlining Redundant Layers to Compress Large Language Models

OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

SGLP: A Similarity Guided Fast Layer Partition Pruning for Compressing Large Deep Models

SparseLLM: Towards Global Pruning for Pre-trained Language Models

DPPA: Pruning Method for Large Language Model to Model Merging

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

LLM-BIP: Structured Pruning for Large Language Models with Block-Wise Forward Importance Propagation

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient