Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

Yanjie Dong,Haijun Zhang,Chengming Li,Song Guo,Victor C. M. Leung,Xiping Hu
2024-10-01
Abstract:Since the invention of GPT2--1.5B in 2019, large language models (LLMs) have transitioned from specialized models to versatile foundation models. The LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment. Traditional fine-tuning techniques with the first-order optimizers require substantial GPU memory that exceeds mainstream hardware capability. Therefore, memory-efficient methods are motivated to be investigated. Model compression techniques can reduce energy consumption, operational costs, and environmental impact so that to support sustainable artificial intelligence advancements. Additionally, large-scale foundation models have expanded to create images, audio, videos, and multi-modal contents, further emphasizing the need for efficient deployment. Therefore, we are motivated to present a comprehensive overview of the prevalent memory-efficient fine-tuning methods over the network edge. We also review the state-of-the-art literatures on model compression to provide a vision on deploying LLMs over the network edge.
Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problems faced by large - language models (LLMs) in fine - tuning and deployment on edge devices. Specifically, the paper focuses on the following key issues: 1. **High storage requirements**: Traditional fine - tuning methods require a large amount of storage space, which exceeds the capabilities of mainstream hardware. For example, when using the Adam optimizer to fine - tune LLMs, the required memory far exceeds that required for inference tasks. This makes it very difficult to deploy LLMs on resource - constrained edge devices. 2. **High computational resource requirements**: LLMs usually contain billions or even hundreds of billions of parameters, resulting in huge computational requirements. Traditional first - order optimizers (such as Adam, AdaGrad, and SGD) need to perform back - propagation operations to obtain the gradients of the loss function, which further increases the computational complexity. 3. **Environmental impact**: The training and deployment of large - scale models consume a large amount of energy, increasing operating costs and environmental impact. Therefore, researchers need to develop energy - efficient methods to support sustainable artificial intelligence development. To solve these problems, the paper proposes several methods: - **Parameter - Efficient Fine - Tuning (PEFT)**: Reduce storage and computational requirements by updating only a small part of the model's weights instead of the entire model. PEFT methods can be divided into parallel PEFT, serial PEFT, and selective PEFT. - **Memory - Efficient Full Fine - Tuning (MEF2T)**: Develop optimizers that do not require back - propagation, such as zero - order optimizers (ZO), to reduce memory usage. These methods can be implemented in a single client or a distributed network. - **Model compression**: Reduce the size of the model through techniques such as pruning, knowledge distillation, and quantization, thereby reducing storage and computational requirements. The paper also discusses how to maintain model performance without retraining. The main contributions of the paper include: - **Review of PEFT and MEF2T methods**: It details the applications of these methods in distributed edge networks and verifies their effectiveness through numerical experiments. - **Classification of model compression methods**: Classify existing model compression methods into post - compression training, recompression training, and one - time compression, and discuss the advantages and disadvantages of each method. - **Future directions**: Propose future research directions, especially the challenges and opportunities in deploying LLMs on edge devices. Through these methods, the paper aims to provide a comprehensive solution for efficiently fine - tuning and deploying large - language models on resource - constrained edge devices.