Abstract:Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands, as well as limited infrastructure support. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization, which uses multiple forward passes to approximate gradients. However, directly applying ZO methods on edge devices is impractical due to the high computational cost of multiple model perturbations required to achieve accuracy improvements. Based on these observations, we propose a memory- and computation-efficient LLM fine-tuning method for edge devices. Our approach has three key innovations: (1) We introduce a parallelized randomized gradient estimation (P-RGE) technique that achieves high parallel efficiency by leveraging outer-loop and inner-loop parallelization. This enables multiple function queries and forward passes to be executed in parallel, reducing training time. (2) We integrate P-RGE with parameter-efficient fine-tuning methods (e.g. LoRA) to further reduce computational and memory overhead. (3) We implement a P-RGE LoRA-FA module that fully supports fine-tuning with ExecuTorch. Our approach requires no modifications to ExecuTorch's runtime code, as it can be implemented with server-side code changes only. Experiments demonstrate that P-RGE achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy, paving the way for practical deployment of LLMs in real-time, on-device applications.

Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud

Maybe Only 0.5 Training Data Instruction Tuning

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models

Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

Fine Tuning LLM for Enterprise: Practical Guidelines and Recommendations

Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning

Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes

Rethinking the Instruction Quality: LIFT is What You Need

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures

Federated Data-Efficient Instruction Tuning for Large Language Models

Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines