SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners

Feng Liang,Yangguang Li,Diana Marculescu
2024-01-21
Abstract:Recently, self-supervised Masked Autoencoders (MAE) have attracted unprecedented attention for their impressive representation learning ability. However, the pretext task, Masked Image Modeling (MIM), reconstructs the missing local patches, lacking the global understanding of the image. This paper extends MAE to a fully supervised setting by adding a supervised classification branch, thereby enabling MAE to learn global features from golden labels effectively. The proposed Supervised MAE (SupMAE) only exploits a visible subset of image patches for classification, unlike the standard supervised pre-training where all image patches are used. Through experiments, we demonstrate that SupMAE is not only more training efficient but it also learns more robust and transferable features. Specifically, SupMAE achieves comparable performance with MAE using only 30% of compute when evaluated on ImageNet with the ViT-B/16 model. SupMAE's robustness on ImageNet variants and transfer learning performance outperforms MAE and standard supervised pre-training counterparts. Codes are available at <a class="link-external link-https" href="https://github.com/enyac-group/supmae" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is that existing self-supervised masked autoencoders (MAE) can learn local features during image reconstruction but lack the ability to understand the global context of the image. To overcome this limitation, the authors propose a new method—Supervised Masked Autoencoder (SupMAE). By adding a supervised classification branch to the MAE, the model can leverage gold standard labels during the pre-training phase to learn global features, thereby improving the model's training efficiency, robustness, and transfer learning capability. Specifically, the main contributions of SupMAE include: 1. **First study on the impact of supervised pre-training on MAE**: This is the first work to explore how to use supervised labels to enhance MAE performance, as gold standard labels can inform the MAE about the concept it is reconstructing. 2. **Innovative supervised classification design**: Unlike traditional supervised pre-training methods, SupMAE uses only partially visible image patches for classification instead of all patches. This design not only improves sample efficiency but also allows the use of all input labels during training, not just the masked parts. 3. **Experimental validation of SupMAE's effectiveness**: Through experiments, the authors demonstrate the advantages of SupMAE in multiple aspects, including higher training efficiency, stronger robustness, and better transfer learning capability. For example, on the ImageNet-1K dataset, SupMAE achieves performance comparable to MAE with only 400 pre-training epochs, whereas MAE requires 1600 epochs. Additionally, SupMAE outperforms MAE and traditional supervised pre-training methods on natural corruption and other variants. In summary, this paper aims to improve the performance of existing self-supervised masked autoencoders by introducing a supervised classification branch, enabling them to learn global features more efficiently and perform better in various visual tasks.