Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines

Lei Gao,Amir Ziashahabi,Yue Niu,Salman Avestimehr,Murali Annavaram
2024-11-07
Abstract:Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands, as well as limited infrastructure support. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization, which uses multiple forward passes to approximate gradients. However, directly applying ZO methods on edge devices is impractical due to the high computational cost of multiple model perturbations required to achieve accuracy improvements. Based on these observations, we propose a memory- and computation-efficient LLM fine-tuning method for edge devices. Our approach has three key innovations: (1) We introduce a parallelized randomized gradient estimation (P-RGE) technique that achieves high parallel efficiency by leveraging outer-loop and inner-loop parallelization. This enables multiple function queries and forward passes to be executed in parallel, reducing training time. (2) We integrate P-RGE with parameter-efficient fine-tuning methods (e.g. LoRA) to further reduce computational and memory overhead. (3) We implement a P-RGE LoRA-FA module that fully supports fine-tuning with ExecuTorch. Our approach requires no modifications to ExecuTorch's runtime code, as it can be implemented with server-side code changes only. Experiments demonstrate that P-RGE achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy, paving the way for practical deployment of LLMs in real-time, on-device applications.
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **How to efficiently fine - tune large - language models (LLMs) on resource - constrained edge devices to achieve personalized applications while protecting user privacy**. Specifically, current LLMs are usually pre - trained and fine - tuned on large - scale servers in the cloud. However, in order to provide a personalized user experience and protect the privacy of sensitive data, the ideal solution is to directly perform fine - tuning on edge devices (such as smart phones, wearable devices, and other Internet of Things devices). This faces the following challenges: 1. **Limited memory and computing resources**: The memory and computing power of edge devices are far lower than those of cloud - based servers, making it difficult to support traditional fine - tuning methods. 2. **Lack of an effective local training framework**: Existing inference engines (such as ExecuTorch) are mainly used for inference and do not support functions required for training such as automatic differentiation and back - propagation. To solve these problems, the authors propose a method based on Zeroth - Order Optimization (ZO) and introduce three key innovations: 1. **Parallelized Random Gradient Estimation (P - RGE) technique**: Through outer - loop and inner - loop parallelization, multiple function queries and forward passes can be executed in parallel, significantly reducing training time. 2. **Combined with parameter - efficient fine - tuning methods (such as LoRA)**: Further reduce computing and memory overhead, making fine - tuning feasible on resource - constrained devices. 3. **Implementation of the P - RGE LoRA module**: Fully support fine - tuning using ExecuTorch without modifying the run - time code of ExecuTorch, and deployment can be achieved with only server - side code changes. These innovations make it possible to fine - tune LLMs on edge devices, which not only improves training speed and memory efficiency but also enhances the accuracy of fine - tuning, thus paving the way for real - time, localized LLM applications. ### Key Formulas 1. **Zero - order gradient estimation formula**: \[ \hat{\nabla}L(\theta; B)=\frac{1}{q} \sum_{i = 1}^{q}\left[\frac{L(\theta+\epsilon z_{i}; B)-L(\theta-\epsilon z_{i}; B)}{2\epsilon z_{i}}\right] \] where \(z_{i}\sim N(0, I_{d})\), \(q\) is the number of function queries, and \(\epsilon>0\) is the perturbation scale. 2. **Parameter update formula**: \[ \theta_{l}\leftarrow\theta_{l}-\eta\left(\frac{1}{q} \sum_{i = 1}^{q}g_{i}z_{i}\right) \] where \(\eta\) is the learning rate and \(g_{i}\) is the projected gradient of the \(i\)-th query. Through these methods, the paper successfully solves the technical problems of fine - tuning LLMs on edge devices and provides new solutions for practical applications.