EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

Zhongzhi Yu,Zheng Wang,Yuhan Li,Haoran You,Ruijie Gao,Xiaoya Zhou,Sreenidhi Reedy Bommu,Yang Katie Zhao,Yingyan Celine Lin

2024-06-22

Abstract:Efficient adaption of large language models (LLMs) on edge devices is essential for applications requiring continuous and privacy-preserving adaptation and inference. However, existing tuning techniques fall short because of the high computation and memory overheads. To this end, we introduce a computation- and memory-efficient LLM tuning framework, called Edge-LLM, to facilitate affordable and effective LLM adaptation on edge devices. Specifically, Edge-LLM features three core components: (1) a layer-wise unified compression (LUC) technique to reduce the computation overhead by generating layer-wise pruning sparsity and quantization bit-width policies, (2) an adaptive layer tuning and voting scheme to reduce the memory overhead by reducing the backpropagation depth, and (3) a complementary hardware scheduling strategy to handle the irregular computation patterns introduced by LUC and adaptive layer tuning, thereby achieving efficient computation and data movements. Extensive experiments demonstrate that Edge-LLM achieves a 2.92x speed up and a 4x memory overhead reduction as compared to vanilla tuning methods with comparable task accuracy. Our code is available at <a class="link-external link-https" href="https://github.com/GATECH-EIC/Edge-LLM" rel="external noopener nofollow">this https URL</a>

Machine Learning,Distributed, Parallel, and Cluster Computing

What problem does this paper attempt to address?

The paper aims to address the issue of efficiently adapting large language models (LLMs) on edge devices. Specifically, existing fine-tuning techniques are difficult to apply directly to edge devices (such as edge GPUs and smartphones) due to high computational and memory overhead. The challenges mentioned in the paper mainly include two aspects: 1. **Computational Overhead**: Encountering high computational burden during the forward and backward propagation of LLMs. 2. **Memory Overhead**: The memory burden required to store a large number of model weights and activations, especially during fine-tuning. To tackle these challenges, the authors propose a framework called Edge-LLM, which achieves computational and memory-efficient LLM fine-tuning through the following three core components: - **Layer-wise Unified Compression (LUC)**: Generates different pruning sparsity and quantization bit-width strategies based on the varying sensitivity of LLM layers to quantization and pruning, thereby reducing computational overhead. - **Adaptive Layer Tuning and Voting**: Reduces memory overhead by decreasing the depth of backpropagation and ensures that all layers can be effectively updated by adaptively selecting different layer segments for updates. - **Complementary Hardware Scheduling Strategy**: Handles the irregular computation patterns introduced by LUC and adaptive layer tuning, optimizing computational efficiency and data transfer. Experimental results show that compared to traditional fine-tuning methods, Edge-LLM can significantly improve efficiency, achieving a 2.92x speedup and a 4x reduction in memory overhead while maintaining comparable task accuracy.

EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing

EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

Efficient and Economic Large Language Model Inference with Attention Offloading

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

Empowering Large Language Models to Edge Intelligence: A Survey of Edge Efficient LLMs and Techniques

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

A Review on Edge Large Language Models: Design, Execution, and Applications

Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

Adaptive Layer Splitting for Wireless LLM Inference in Edge Computing: A Model-Based Reinforcement Learning Approach

New Solutions on LLM Acceleration, Optimization, and Application

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Resource Allocation for Stable LLM Training in Mobile Edge Computing

Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks

Energy-Efficient Split Learning for Fine-Tuning Large Language Models in Edge Networks

Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning