Abstract:Parameter-efficient fine-tuning (PEFT) techniques make it possible to efficiently adapt a language model to create "expert" models that specialize to new tasks or domains. Recent techniques in model merging and compositional generalization leverage these expert models by dynamically composing modules to improve zero/few-shot generalization. Despite the efficiency of PEFT methods, the size of expert models can make it onerous to retrieve expert models per query over high-latency networks like the Internet or serve multiple experts on a single GPU. To address these issues, we present ComPEFT, a novel method for compressing fine-tuning residuals (task vectors) of PEFT based models. ComPEFT employs sparsification and ternary quantization to reduce the size of the PEFT module without performing any additional retraining while preserving or enhancing model performance. In extensive evaluation across T5, T0, and LLaMA-based models with 200M - 65B parameters, ComPEFT achieves compression ratios of 8x - 50x. In particular, we show that ComPEFT improves with scale - stronger models exhibit higher compressibility and better performance. For example, we show that ComPEFT applied to LLaMA outperforms QLoRA by 4.16% on MMLU with a storage size reduction of up to 26x. In addition, we show that the compressed experts produced by ComPEFT maintain few-shot compositional generalization capabilities, facilitate efficient communication and computation, and exhibit enhanced performance when merged. Lastly, we provide an analysis of different method components, compare it with other PEFT methods, and test ComPEFT's efficacy for compressing the residual of full-finetuning. Our code is available at <a class="link-external link-https" href="https://github.com/prateeky2806/compeft" rel="external noopener nofollow">this https URL</a>.

Weight Squeezing: Reparameterization for Knowledge Transfer and Model Compression

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

Weight-Inherited Distillation for Task-Agnostic BERT Compression

Weight Distillation: Transferring the Knowledge in Neural Network Parameters

Pea-KD: Parameter-efficient and Accurate Knowledge Distillation on BERT

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models

Deep-to-Bottom Weights Decay: A Systemic Knowledge Review Learning Technique for Transformer Layers in Knowledge Distillation

A Model Compression Method Using Significant Data and Knowledge Distillation

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

Hierarchical Knowledge Squeezed Adversarial Network Compression

Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression

Uncertainty-Driven Knowledge Distillation for Language Model Compression.

Knowledge Translation: A New Pathway for Model Compression

AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression

Patient Knowledge Distillation for BERT Model Compression

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Parameter-efficient Weight Ensembling Facilitates Task-level Knowledge Transfer.

ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

Exploring Extreme Parameter Compression for Pre-trained Language Models

Knowledge Distillation Application Technology for Chinese NLP