Abstract:The gradients used to train neural networks are typically computed using backpropagation. While an efficient way to obtain exact gradients, backpropagation is computationally expensive, hinders parallelization, and is biologically implausible. Forward gradients are an approach to approximate the gradients from directional derivatives along random tangents computed by forward-mode automatic differentiation. So far, research has focused on using a single tangent per step. This paper provides an in-depth analysis of multi-tangent forward gradients and introduces an improved approach to combining the forward gradients from multiple tangents based on orthogonal projections. We demonstrate that increasing the number of tangents improves both approximation quality and optimization performance across various tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to optimize the training of neural networks through multi - tangent forward gradients, so as to overcome the disadvantages of backpropagation such as high computational cost, difficulty in parallelization, and biological implausibility.** Specifically, although backpropagation can efficiently obtain accurate gradients, it has the following problems: 1. **High computational cost**: The time complexity of backpropagation is approximately twice that of the forward pass, which takes up a large amount of training time and consumes a great deal of energy. 2. **Difficult to parallelize**: The dependency relationships in backpropagation lead to sub - optimal memory access patterns, which hinder parallelization. 3. **Biologically implausible**: There is no similar reverse path in biological neural networks to transmit update information. To solve these problems, the paper proposes a forward - gradient method based on multi - tangents. Forward gradients approximate gradients by calculating directional derivatives along random tangent vectors through forward automatic differentiation, thus avoiding the above - mentioned problems of backpropagation. However, existing research mainly focuses on using a single tangent vector, while this paper deeply analyzes multi - tangent forward gradients and introduces an improved method based on orthogonal projection to combine multiple forward gradients. Research shows that increasing the number of tangent vectors can improve the approximation quality and optimization performance. ### Main research questions The paper aims to answer the following research questions: 1. **RQ1**: Can using multiple tangent vectors improve forward gradients? 2. **RQ2**: How to combine forward - gradient information from multiple tangent vectors? 3. **RQ3**: Can multi - tangent forward gradients be extended to state - of - the - art architectures? 4. **RQ4**: What are the trade - offs of using multiple tangent vectors? ### Solutions The methods proposed in the paper include: - Using multiple random tangent vectors to approximate gradients, thereby improving the approximation quality. - Introducing an orthogonal projection method to combine multiple forward gradients to reduce errors and improve accuracy. - Verifying the performance of multi - tangent forward gradients in different tasks through experiments, including optimizing closed - form functions and training neural networks. Through these methods, the paper demonstrates the potential of multi - tangent forward gradients in improving approximation quality and optimization performance, and provides a new direction for further research.

Beyond Backpropagation: Optimization with Multi-Tangent Forward Gradients

Gradient Descent: The Ultimate Optimizer

Automatic Differentiation-Based Multi-Start for Gradient-Based Optimization Methods

Feed-Forward Optimization With Delayed Feedback for Neural Networks

Gradient Adversarial Training of Neural Networks

On Training Implicit Models

Gradient Correction Beyond Gradient Descent

Towards Differentiable Multilevel Optimization: A Gradient-Based Approach

BackPACK: Packing more into backprop

Stabilizing Backpropagation Through Time to Learn Complex Physics

Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

Gradient Flossing: Improving Gradient Descent through Dynamic Control of Jacobians

The Hessian by blocks for neural network by backward propagation

Adaptive Stochastic Conjugate Gradient Optimization for Backpropagation Neural Networks

Learning Gradient Descent: Better Generalization and Longer Horizons

Learning with Local Gradients at the Edge

Alternating Differentiation for Optimization Layers

DANTE: Deep alternations for training neural networks

Gradient Descent based Optimization Algorithms for Deep Learning Models Training

Accelerated Gradient-free Neural Network Training by Multi-convex Alternating Optimization

A Multi-task Learning Approach by Combining Derivative-Free and Gradient Methods.