Rethink Long-tailed Recognition with Vision Transformers

Zhengzhuo Xu,Shuo Yang,Xingjun Wang,Chun Yuan

DOI: https://doi.org/10.48550/arXiv.2302.14284

2023-04-17

Abstract:In the real world, data tends to follow long-tailed distributions w.r.t. class or attribution, motivating the challenging Long-Tailed Recognition (LTR) problem. In this paper, we revisit recent LTR methods with promising Vision Transformers (ViT). We figure out that 1) ViT is hard to train with long-tailed data. 2) ViT learns generalized features in an unsupervised manner, like mask generative training, either on long-tailed or balanced datasets. Hence, we propose to adopt unsupervised learning to utilize long-tailed data. Furthermore, we propose the Predictive Distribution Calibration (PDC) as a novel metric for LTR, where the model tends to simply classify inputs into common classes. Our PDC can measure the model calibration of predictive preferences quantitatively. On this basis, we find many LTR approaches alleviate it slightly, despite the accuracy improvement. Extensive experiments on benchmark datasets validate that PDC reflects the model's predictive preference precisely, which is consistent with the visualization.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper attempts to solve the challenging problem of long - tailed distribution data in visual recognition (Long - Tailed Recognition, LTR). Specifically, data in the real world usually follows a long - tailed distribution, that is, the number of samples in some categories is very small (tail categories), while the number of samples in other categories is very large (head categories). This unbalanced data distribution causes the model to tend to over - fit to the head categories and have poor generalization ability for the tail categories. Therefore, how to train a well - performing model using long - tailed distribution data has become an important research topic. The main contributions of the paper include: 1. It is found that it is difficult to train on long - tailed data using Vision Transformers (ViT), but this problem can be significantly improved through unsupervised pre - training. 2. Predictive Distribution Calibration (PDC) is proposed as a new metric for quantitatively evaluating the prediction preference of the model on long - tailed distribution data. PDC can measure the distance between the model's prediction distribution and the target distribution, thus more intuitively reflecting whether the model is biased towards predicting common head categories. 3. Through extensive experimental analysis of the performance of different LTR methods on ViT, the effectiveness of PDC is verified, indicating that PDC can accurately reflect the prediction deviation of the model and is consistent with the visualization results. These contributions are helpful for better understanding and solving the challenges of long - tailed distribution data in visual recognition tasks.

Rethink Long-tailed Recognition with Vision Transformers

Rethink Long-tailed Recognition with Vision Transforms.

Learning Imbalanced Data with Vision Transformers

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

Long-tailed Visual Recognition with Deep Models: A Methodological Survey and Evaluation

Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

Data Level Lottery Ticket Hypothesis for Vision Transformers

ViT-Calibrator: Decision Stream Calibration for Vision Transformer

Boosting Vanilla Lightweight Vision Transformers Via Re-parameterization

LTRL: Boosting Long-tail Recognition via Reflective Learning

Exploring Efficient Few-shot Adaptation for Vision Transformers

Make A Long Image Short: Adaptive Token Length for Vision Transformers

Towards Calibrated Model for Long-Tailed Visual Recognition from Prior Perspective

A Deep Learning Model for Long-Tail Visual Recognition

Effective Vision Transformer Training: A Data-Centric Perspective

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Multi-Tailed Vision Transformer for Efficient Inference

Lite Vision Transformer with Enhanced Self-Attention

Lessons Learned from a Unifying Empirical Study of Parameter-Efficient Transfer Learning (PETL) in Visual Recognition