Abstract:Continuously adapting pre-trained models to local data on resource constrained edge devices is the $\emph{last mile}$ for model deployment. However, as models increase in size and depth, backpropagation requires a large amount of memory, which becomes prohibitive for edge devices. In addition, most existing low power neural processing engines (e.g., NPUs, DSPs, MCUs, etc.) are designed as fixed-point inference accelerators, without training capabilities. Forward gradients, solely based on directional derivatives computed from two forward calls, have been recently used for model training, with substantial savings in computation and memory. However, the performance of quantized training with fixed-point forward gradients remains unclear. In this paper, we investigate the feasibility of on-device training using fixed-point forward gradients, by conducting comprehensive experiments across a variety of deep learning benchmark tasks in both vision and audio domains. We propose a series of algorithm enhancements that further reduce the memory footprint, and the accuracy gap compared to backpropagation. An empirical study on how training with forward gradients navigates in the loss landscape is further explored. Our results demonstrate that on the last mile of model customization on edge devices, training with fixed-point forward gradients is a feasible and practical approach.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the continuous local adaptation and training of models on resource - constrained edge devices. Specifically, as the scale and depth of pre - trained models keep increasing, the traditional backpropagation method requires a large amount of memory to store intermediate activation values, which is unbearable for edge devices. In addition, most of the existing low - power neural processing engines (such as NPUs, DSPs, MCUs, etc.) are mainly designed for fixed - point inference acceleration and do not have training capabilities. Therefore, this paper proposes and studies the feasibility of using fixed - point forward gradients for device - side training to solve the following key problems: 1. **Memory Consumption**: The traditional backpropagation method requires a large amount of memory on edge devices, while the forward gradient method can significantly reduce memory usage by estimating gradients using only two forward calls. 2. **Computational Efficiency**: Fixed - point operations are more efficient than floating - point operations and are more suitable for resource - constrained edge devices. 3. **Model Performance**: Research on the effectiveness of fixed - point forward - gradient training to ensure that it can achieve performance similar to backpropagation in a variety of deep - learning tasks. To verify these problems, the author conducted extensive experiments covering multiple benchmark tasks in the visual and audio fields and proposed a series of algorithm enhancement measures to further reduce memory usage and improve model accuracy. The experimental results show that using fixed - point forward gradients for device - side training is a feasible and practical method that can achieve model performance similar to backpropagation while maintaining low memory and computational costs. ### Formula Summary - **Definition of Forward Gradient**: \[ g(w)=(\nabla f(w)\cdot z)z \] where $z\in\mathbb{R}^n$ is a random perturbation vector, satisfying $z\sim p(z)$, and each component $z_i$ is independently and identically distributed with a mean of 0 and a variance of 1. - **SPSA Gradient Estimation**: \[ \hat{g}(w)=\frac{L(w + \epsilon z)-L(w-\epsilon z)}{2\epsilon z} \] where $z\sim\mathcal{N}(0, I_n)$, $\epsilon$ is a small constant (for example, $1e^{-3}$). - **Sign - m - SPSA Gradient Estimation**: \[ \hat{g}(w)=\frac{1}{m}\sum_{i = 1}^{m}\text{sign}(L(w+\epsilon z_i)-L(w-\epsilon z_i))z_i \] - **Quantized Weight Update**: \[ w_{t + 1}=w_t-\eta\hat{g}_f \] where $\hat{g}_f$ is the quantized forward gradient estimated by Sign - m - SPSA. These formulas show how to perform gradient estimation and weight update in the fixed - point space, thereby achieving efficient device - side training.

Stepping Forward on the Last Mile

Condense: A Framework for Device and Frequency Adaptive Neural Network Models on the Edge.

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Poor Man's Training on MCUs: A Memory-Efficient Quantized Back-Propagation-Free Approach

Efficient On-device Training via Gradient Filtering

Efficient Training Convolutional Neural Networks on Edge Devices with Gradient-pruned Sign-symmetric Feedback Alignment

Exploring the Use of Synthetic Gradients for Distributed Deep Learning Across Cloud and Edge Resources.

EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data Reshaping for Online Adaptation or Personalization

On-Device Training Under 256KB Memory

Gradient-Free Neural Network Training on the Edge

3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low BitwidthQuantization, and Ultra-Low Latency Acceleration

Towards Efficient Compact Network Training on Edge-Devices

AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments

Adaptive Precision Training for Resource Constrained Devices

Enabling Binary Neural Network Training on the Edge

Enabling Incremental Training with Forward Pass for Edge Devices

Eliminating Communication Bottlenecks in Cross-Device Federated Learning with In-Network Processing at the Edge

Accelerating DNN Training in Wireless Federated Edge Learning Systems

Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy

Enabling Deep Learning on Edge Devices through Filter Pruning and Knowledge Transfer

Enabling Deep Learning on Edge Devices