Abstract:Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: <a class="link-external link-https" href="https://github.com/hsouri/Battle-of-the-Backbones" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem this paper attempts to address is that in computer vision tasks, selecting an appropriate pre-trained model (backbone) is crucial for building efficient and high-performance systems. However, there are currently a large number of pre-trained models that have been trained using different algorithms, datasets, and architectures, making it difficult for practitioners to choose the right backbone. The paper conducts a large-scale comparison of various popular pre-trained models through the "Battle of the Backbones (BoB)" project, aiming to provide practitioners with a comprehensive reference to help them select the most suitable backbone based on specific task requirements. Specifically, the main objectives of the paper include: 1. **Benchmarking**: Providing a systematic comparison framework by comprehensively benchmarking the performance of a large number of pre-trained models on different computer vision tasks. 2. **Guiding Research Directions**: Revealing the strengths and limitations of existing pre-training methods and architectures to provide directions for future research. 3. **Performance Analysis**: Analyzing the impact of different pre-training methods, dataset scales, and model architectures on performance, especially under the same conditions. 4. **Resource Provision**: Publishing experimental results and code so that other researchers can reproduce and extend these studies. Through these objectives, the paper hopes to reduce the uncertainty practitioners face when selecting pre-trained models and to promote further development in the field of computer vision.

Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Backbones-Review: Feature Extraction Networks for Deep Learning and Deep Reinforcement Learning Approaches

Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results

VIBES -- Vision Backbone Efficient Selection

Simpler is Better: off-the-shelf Continual Learning Through Pretrained Backbones

Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances

EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones.

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

A large-scale examination of inductive biases shaping high-level visual representation in brains and machines

Deep Convolutional Backbone Comparison for Automated PET Image Quality Assessment

RRR-Net: Reusing, Reducing, and Recycling a Deep Backbone Network

Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training

The Neural Representation Benchmark and its Evaluation on Brain and Machine

Image recognition in depth: comparative study of CNN and Pre-trained VGG16 architecture for classification tasks

EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training

Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors

LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones

Data-Free Backbone Fine-Tuning for Pruned Neural Networks

Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream