Abstract:Backpropagation (BP) is the cornerstone of today's deep learning algorithms, but it is inefficient partially because of backward locking, which means updating the weights of one layer locks the weight updates in the other layers. Consequently, it is challenging to apply parallel computing or a pipeline structure to update the weights in different layers simultaneously. In this paper, we introduce a novel learning structure called associated learning (AL), which modularizes the network into smaller components, each of which has a local objective. Because the objectives are mutually independent, AL can learn the parameters in different layers independently and simultaneously, so it is feasible to apply a pipeline structure to improve the training throughput. Specifically, this pipeline structure improves the complexity of the training time from O(nl), which is the time complexity when using BP and stochastic gradient descent (SGD) for training, to O(n + l), where n is the number of training instances and l is the number of hidden layers. Surprisingly, even though most of the parameters in AL do not directly interact with the target variable, training deep models by this method yields accuracies comparable to those from models trained using typical BP methods, in which all parameters are used to predict the target variable. Consequently, because of the scalability and the predictive power demonstrated in the experiments, AL deserves further study to determine the better hyperparameter settings, such as activation function selection, learning rate scheduling, and weight initialization, to accumulate experience, as we have done over the years with the typical BP method. Additionally, perhaps our design can also inspire new network designs for deep learning. Our implementation is available at https://github.com/SamYWK/Associated_Learning.

BPPSA: Scaling Back-propagation by Parallel Scan Algorithm

2BP: 2-Stage Backpropagation

Parallelizing non-linear sequential models over the sequence length

FP-MRBP: Fine-grained Parallel MapReduce Back Propagation Algorithm.

Pipelined Backpropagation at Scale: Training Large Models without Batches

PaReprop: Fast Parallelized Reversible Backpropagation

Interlocking Backpropagation: Improving depthwise model-parallelism

Unlocking Deep Learning: A BP-Free Approach for Parallel Block-Wise Training of Neural Networks

An Optimized and Energy-Efficient Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks

A Practical Layer-Parallel Training Algorithm for Residual Networks

DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation

DBP: Discrimination Based Block-Level Pruning for Deep Model Acceleration.

Associated Learning: Decomposing End-to-end Backpropagation based on Auto-encoders and Target Propagation

Were RNNs All We Needed?

BP(λ): Online Learning via Synthetic Gradients

Advancing Training Efficiency of Deep Spiking Neural Networks through Rate-based Backpropagation

Towards Scalable and Stable Parallelization of Nonlinear RNNs

Efficient Neural Network Training Via Forward and Backward Propagation Sparsification

Unlocking the Potential of Similarity Matching: Scalability, Supervision and Pre-training

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Going Wider: Recurrent Neural Network with Parallel Cells