EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

Zhongzhi Yu,Zheng Wang,Yuhan Li,Haoran You,Ruijie Gao,Xiaoya Zhou,Sreenidhi Reedy Bommu,Yang Katie Zhao,Yingyan Celine Lin
2024-06-22
Abstract:Efficient adaption of large language models (LLMs) on edge devices is essential for applications requiring continuous and privacy-preserving adaptation and inference. However, existing tuning techniques fall short because of the high computation and memory overheads. To this end, we introduce a computation- and memory-efficient LLM tuning framework, called Edge-LLM, to facilitate affordable and effective LLM adaptation on edge devices. Specifically, Edge-LLM features three core components: (1) a layer-wise unified compression (LUC) technique to reduce the computation overhead by generating layer-wise pruning sparsity and quantization bit-width policies, (2) an adaptive layer tuning and voting scheme to reduce the memory overhead by reducing the backpropagation depth, and (3) a complementary hardware scheduling strategy to handle the irregular computation patterns introduced by LUC and adaptive layer tuning, thereby achieving efficient computation and data movements. Extensive experiments demonstrate that Edge-LLM achieves a 2.92x speed up and a 4x memory overhead reduction as compared to vanilla tuning methods with comparable task accuracy. Our code is available at <a class="link-external link-https" href="https://github.com/GATECH-EIC/Edge-LLM" rel="external noopener nofollow">this https URL</a>
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper aims to address the issue of efficiently adapting large language models (LLMs) on edge devices. Specifically, existing fine-tuning techniques are difficult to apply directly to edge devices (such as edge GPUs and smartphones) due to high computational and memory overhead. The challenges mentioned in the paper mainly include two aspects: 1. **Computational Overhead**: Encountering high computational burden during the forward and backward propagation of LLMs. 2. **Memory Overhead**: The memory burden required to store a large number of model weights and activations, especially during fine-tuning. To tackle these challenges, the authors propose a framework called Edge-LLM, which achieves computational and memory-efficient LLM fine-tuning through the following three core components: - **Layer-wise Unified Compression (LUC)**: Generates different pruning sparsity and quantization bit-width strategies based on the varying sensitivity of LLM layers to quantization and pruning, thereby reducing computational overhead. - **Adaptive Layer Tuning and Voting**: Reduces memory overhead by decreasing the depth of backpropagation and ensures that all layers can be effectively updated by adaptively selecting different layer segments for updates. - **Complementary Hardware Scheduling Strategy**: Handles the irregular computation patterns introduced by LUC and adaptive layer tuning, optimizing computational efficiency and data transfer. Experimental results show that compared to traditional fine-tuning methods, Edge-LLM can significantly improve efficiency, achieving a 2.92x speedup and a 4x reduction in memory overhead while maintaining comparable task accuracy.