Abstract:Since the invention of GPT2--1.5B in 2019, large language models (LLMs) have transitioned from specialized models to versatile foundation models. The LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment. Traditional fine-tuning techniques with the first-order optimizers require substantial GPU memory that exceeds mainstream hardware capability. Therefore, memory-efficient methods are motivated to be investigated. Model compression techniques can reduce energy consumption, operational costs, and environmental impact so that to support sustainable artificial intelligence advancements. Additionally, large-scale foundation models have expanded to create images, audio, videos, and multi-modal contents, further emphasizing the need for efficient deployment. Therefore, we are motivated to present a comprehensive overview of the prevalent memory-efficient fine-tuning methods over the network edge. We also review the state-of-the-art literatures on model compression to provide a vision on deploying LLMs over the network edge.

What problem does this paper attempt to address?

This paper attempts to solve the problems faced by large - language models (LLMs) in fine - tuning and deployment on edge devices. Specifically, the paper focuses on the following key issues: 1. **High storage requirements**: Traditional fine - tuning methods require a large amount of storage space, which exceeds the capabilities of mainstream hardware. For example, when using the Adam optimizer to fine - tune LLMs, the required memory far exceeds that required for inference tasks. This makes it very difficult to deploy LLMs on resource - constrained edge devices. 2. **High computational resource requirements**: LLMs usually contain billions or even hundreds of billions of parameters, resulting in huge computational requirements. Traditional first - order optimizers (such as Adam, AdaGrad, and SGD) need to perform back - propagation operations to obtain the gradients of the loss function, which further increases the computational complexity. 3. **Environmental impact**: The training and deployment of large - scale models consume a large amount of energy, increasing operating costs and environmental impact. Therefore, researchers need to develop energy - efficient methods to support sustainable artificial intelligence development. To solve these problems, the paper proposes several methods: - **Parameter - Efficient Fine - Tuning (PEFT)**: Reduce storage and computational requirements by updating only a small part of the model's weights instead of the entire model. PEFT methods can be divided into parallel PEFT, serial PEFT, and selective PEFT. - **Memory - Efficient Full Fine - Tuning (MEF2T)**: Develop optimizers that do not require back - propagation, such as zero - order optimizers (ZO), to reduce memory usage. These methods can be implemented in a single client or a distributed network. - **Model compression**: Reduce the size of the model through techniques such as pruning, knowledge distillation, and quantization, thereby reducing storage and computational requirements. The paper also discusses how to maintain model performance without retraining. The main contributions of the paper include: - **Review of PEFT and MEF2T methods**: It details the applications of these methods in distributed edge networks and verifies their effectiveness through numerical experiments. - **Classification of model compression methods**: Classify existing model compression methods into post - compression training, recompression training, and one - time compression, and discuss the advantages and disadvantages of each method. - **Future directions**: Propose future research directions, especially the challenges and opportunities in deploying LLMs on edge devices. Through these methods, the paper aims to provide a comprehensive solution for efficiently fine - tuning and deploying large - language models on resource - constrained edge devices.

Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

An Empirical Analysis and Resource Footprint Study of Deploying Large Language Models on Edge Devices

Large Language Models (LLMs): Deployment, Tokenomics and Sustainability

A Review on Edge Large Language Models: Design, Execution, and Applications

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Optimizing Microservice Deployment in Edge Computing with Large Language Models: Integrating Retrieval Augmented Generation and Chain of Thought Techniques

Energy-Efficient Split Learning for Fine-Tuning Large Language Models in Edge Networks

Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities

Large Language Models Empowered Autonomous Edge AI for Connected Intelligence

A Study of Optimizations for Fine-tuning Large Language Models

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

Aggressive Post-Training Compression on Extremely Large Language Models

Toward Democratized Generative AI in Next-Generation Mobile Edge Networks

Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines