Blockwise Self-Supervised Learning at Scale

Shoaib Ahmed Siddiqui,David Krueger,Yann LeCun,Stéphane Deny
2024-08-11
Abstract:Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of using local learning rules instead of end-to-end backpropagation for training deep neural networks on large-scale datasets. Specifically, the paper explores the possibility of training large deep networks through blockwise self-supervised learning methods, aiming to reduce the dependence on long backpropagation paths, thereby improving biological plausibility, reducing energy consumption, and optimizing the training efficiency of large-scale networks. The main contributions of the paper include: 1. Demonstrating that a ResNet-50 model trained using blockwise self-supervised learning methods performs almost on par with a model trained with end-to-end backpropagation on ImageNet, with only about a 1.1% difference in Top-1 classification accuracy. 2. Finding that training all blocks simultaneously (rather than sequentially) is crucial for achieving this performance level, indicating that the forward path facilitates learning interactions between blocks during training. 3. Exploring the impact of different spatial and feature pooling strategies on the outputs of intermediate blocks, discovering that expanding the feature dimensions of block outputs is key to successfully training the network. 4. Evaluating various customized training strategies, such as adjusting the trade-off parameters in the objective function and applying different image distortions to different blocks. Although these strategies did not significantly improve performance in the current experiments, they are considered promising directions for future research. In summary, by introducing the blockwise self-supervised learning method, this paper provides new insights for the effective application of local learning rules on large-scale datasets. This has important implications for understanding the learning mechanisms of the brain, designing more efficient hardware, and developing adaptive computing systems.