Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

Micah Goldblum,Hossein Souri,Renkun Ni,Manli Shu,Viraj Prabhu,Gowthami Somepalli,Prithvijit Chattopadhyay,Mark Ibrahim,Adrien Bardes,Judy Hoffman,Rama Chellappa,Andrew Gordon Wilson,Tom Goldstein
2023-11-20
Abstract:Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: <a class="link-external link-https" href="https://github.com/hsouri/Battle-of-the-Backbones" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is that in computer vision tasks, selecting an appropriate pre-trained model (backbone) is crucial for building efficient and high-performance systems. However, there are currently a large number of pre-trained models that have been trained using different algorithms, datasets, and architectures, making it difficult for practitioners to choose the right backbone. The paper conducts a large-scale comparison of various popular pre-trained models through the "Battle of the Backbones (BoB)" project, aiming to provide practitioners with a comprehensive reference to help them select the most suitable backbone based on specific task requirements. Specifically, the main objectives of the paper include: 1. **Benchmarking**: Providing a systematic comparison framework by comprehensively benchmarking the performance of a large number of pre-trained models on different computer vision tasks. 2. **Guiding Research Directions**: Revealing the strengths and limitations of existing pre-training methods and architectures to provide directions for future research. 3. **Performance Analysis**: Analyzing the impact of different pre-training methods, dataset scales, and model architectures on performance, especially under the same conditions. 4. **Resource Provision**: Publishing experimental results and code so that other researchers can reproduce and extend these studies. Through these objectives, the paper hopes to reduce the uncertainty practitioners face when selecting pre-trained models and to promote further development in the field of computer vision.