Rethink Long-tailed Recognition with Vision Transformers

Zhengzhuo Xu,Shuo Yang,Xingjun Wang,Chun Yuan
DOI: https://doi.org/10.48550/arXiv.2302.14284
2023-04-17
Abstract:In the real world, data tends to follow long-tailed distributions w.r.t. class or attribution, motivating the challenging Long-Tailed Recognition (LTR) problem. In this paper, we revisit recent LTR methods with promising Vision Transformers (ViT). We figure out that 1) ViT is hard to train with long-tailed data. 2) ViT learns generalized features in an unsupervised manner, like mask generative training, either on long-tailed or balanced datasets. Hence, we propose to adopt unsupervised learning to utilize long-tailed data. Furthermore, we propose the Predictive Distribution Calibration (PDC) as a novel metric for LTR, where the model tends to simply classify inputs into common classes. Our PDC can measure the model calibration of predictive preferences quantitatively. On this basis, we find many LTR approaches alleviate it slightly, despite the accuracy improvement. Extensive experiments on benchmark datasets validate that PDC reflects the model's predictive preference precisely, which is consistent with the visualization.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the challenging problem of long - tailed distribution data in visual recognition (Long - Tailed Recognition, LTR). Specifically, data in the real world usually follows a long - tailed distribution, that is, the number of samples in some categories is very small (tail categories), while the number of samples in other categories is very large (head categories). This unbalanced data distribution causes the model to tend to over - fit to the head categories and have poor generalization ability for the tail categories. Therefore, how to train a well - performing model using long - tailed distribution data has become an important research topic. The main contributions of the paper include: 1. It is found that it is difficult to train on long - tailed data using Vision Transformers (ViT), but this problem can be significantly improved through unsupervised pre - training. 2. Predictive Distribution Calibration (PDC) is proposed as a new metric for quantitatively evaluating the prediction preference of the model on long - tailed distribution data. PDC can measure the distance between the model's prediction distribution and the target distribution, thus more intuitively reflecting whether the model is biased towards predicting common head categories. 3. Through extensive experimental analysis of the performance of different LTR methods on ViT, the effectiveness of PDC is verified, indicating that PDC can accurately reflect the prediction deviation of the model and is consistent with the visualization results. These contributions are helpful for better understanding and solving the challenges of long - tailed distribution data in visual recognition tasks.