Abstract:Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands, as well as limited infrastructure support. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization, which uses multiple forward passes to approximate gradients. However, directly applying ZO methods on edge devices is impractical due to the high computational cost of multiple model perturbations required to achieve accuracy improvements. Based on these observations, we propose a memory- and computation-efficient LLM fine-tuning method for edge devices. Our approach has three key innovations: (1) We introduce a parallelized randomized gradient estimation (P-RGE) technique that achieves high parallel efficiency by leveraging outer-loop and inner-loop parallelization. This enables multiple function queries and forward passes to be executed in parallel, reducing training time. (2) We integrate P-RGE with parameter-efficient fine-tuning methods (e.g. LoRA) to further reduce computational and memory overhead. (3) We implement a P-RGE LoRA-FA module that fully supports fine-tuning with ExecuTorch. Our approach requires no modifications to ExecuTorch's runtime code, as it can be implemented with server-side code changes only. Experiments demonstrate that P-RGE achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy, paving the way for practical deployment of LLMs in real-time, on-device applications.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to efficiently fine - tune large - language models (LLMs) on resource - constrained edge devices to achieve personalized applications while protecting user privacy**. Specifically, current LLMs are usually pre - trained and fine - tuned on large - scale servers in the cloud. However, in order to provide a personalized user experience and protect the privacy of sensitive data, the ideal solution is to directly perform fine - tuning on edge devices (such as smart phones, wearable devices, and other Internet of Things devices). This faces the following challenges: 1. **Limited memory and computing resources**: The memory and computing power of edge devices are far lower than those of cloud - based servers, making it difficult to support traditional fine - tuning methods. 2. **Lack of an effective local training framework**: Existing inference engines (such as ExecuTorch) are mainly used for inference and do not support functions required for training such as automatic differentiation and back - propagation. To solve these problems, the authors propose a method based on Zeroth - Order Optimization (ZO) and introduce three key innovations: 1. **Parallelized Random Gradient Estimation (P - RGE) technique**: Through outer - loop and inner - loop parallelization, multiple function queries and forward passes can be executed in parallel, significantly reducing training time. 2. **Combined with parameter - efficient fine - tuning methods (such as LoRA)**: Further reduce computing and memory overhead, making fine - tuning feasible on resource - constrained devices. 3. **Implementation of the P - RGE LoRA module**: Fully support fine - tuning using ExecuTorch without modifying the run - time code of ExecuTorch, and deployment can be achieved with only server - side code changes. These innovations make it possible to fine - tune LLMs on edge devices, which not only improves training speed and memory efficiency but also enhances the accuracy of fine - tuning, thus paving the way for real - time, localized LLM applications. ### Key Formulas 1. **Zero - order gradient estimation formula**: \[ \hat{\nabla}L(\theta; B)=\frac{1}{q} \sum_{i = 1}^{q}\left[\frac{L(\theta+\epsilon z_{i}; B)-L(\theta-\epsilon z_{i}; B)}{2\epsilon z_{i}}\right] \] where \(z_{i}\sim N(0, I_{d})\), \(q\) is the number of function queries, and \(\epsilon>0\) is the perturbation scale. 2. **Parameter update formula**: \[ \theta_{l}\leftarrow\theta_{l}-\eta\left(\frac{1}{q} \sum_{i = 1}^{q}g_{i}z_{i}\right) \] where \(\eta\) is the learning rate and \(g_{i}\) is the projected gradient of the \(i\)-th query. Through these methods, the paper successfully solves the technical problems of fine - tuning LLMs on edge devices and provides new solutions for practical applications.

Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines

PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs

Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning

EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

Practical offloading for fine-tuning LLM on commodity GPU via learned subspace projectors

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing

Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

Efficient Fine-Tuning of BERT Models on the Edge

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

Resource Allocation for Stable LLM Training in Mobile Edge Computing

mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

Understanding the Performance and Estimating the Cost of LLM Fine-Tuning