Blockwise Self-Supervised Learning at Scale

Shoaib Ahmed Siddiqui,David Krueger,Yann LeCun,Stéphane Deny

2024-08-11

Abstract:Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the problem of using local learning rules instead of end-to-end backpropagation for training deep neural networks on large-scale datasets. Specifically, the paper explores the possibility of training large deep networks through blockwise self-supervised learning methods, aiming to reduce the dependence on long backpropagation paths, thereby improving biological plausibility, reducing energy consumption, and optimizing the training efficiency of large-scale networks. The main contributions of the paper include: 1. Demonstrating that a ResNet-50 model trained using blockwise self-supervised learning methods performs almost on par with a model trained with end-to-end backpropagation on ImageNet, with only about a 1.1% difference in Top-1 classification accuracy. 2. Finding that training all blocks simultaneously (rather than sequentially) is crucial for achieving this performance level, indicating that the forward path facilitates learning interactions between blocks during training. 3. Exploring the impact of different spatial and feature pooling strategies on the outputs of intermediate blocks, discovering that expanding the feature dimensions of block outputs is key to successfully training the network. 4. Evaluating various customized training strategies, such as adjusting the trade-off parameters in the objective function and applying different image distortions to different blocks. Although these strategies did not significantly improve performance in the current experiments, they are considered promising directions for future research. In summary, by introducing the blockwise self-supervised learning method, this paper provides new insights for the effective application of local learning rules on large-scale datasets. This has important implications for understanding the learning mechanisms of the brain, designing more efficient hardware, and developing adaptive computing systems.

Blockwise Self-Supervised Learning at Scale

Deeply Supervised Block-Wise Neural Architecture Search

Unlocking Deep Learning: A BP-Free Approach for Parallel Block-Wise Training of Neural Networks

Block-local learning with probabilistic latent representations

Big Self-Supervised Models are Strong Semi-Supervised Learners

RRR-Net: Reusing, Reducing, and Recycling a Deep Backbone Network

HS-ResNet: Hierarchical-Split Block on Convolutional Neural Network

Scaling and Benchmarking Self-Supervised Visual Representation Learning

Self-supervised Pretraining of Visual Features in the Wild

Block-wise Training of Residual Networks via the Minimizing Movement Scheme

Exploring the Limits of Weakly Supervised Pretraining

Res2Net: A New Multi-Scale Backbone Architecture

BlockQNN: Efficient Block-Wise Neural Network Architecture Generation

High-Performance Large-Scale Image Recognition Without Normalization

Large-scale Self-Normalizing Neural Networks

Image Super-Resolution Via Residual Block Attention Networks.

Self-Adaptive Training: Bridging Supervised and Self-Supervised Learning.

BlockDrop: Dynamic Inference Paths in Residual Networks

Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones

Supervised and Contrastive Self-Supervised In-Domain Representation Learning for Dense Prediction Problems in Remote Sensing

Learning Deep ResNet Blocks Sequentially using Boosting Theory