Text-Guided Mixup Towards Long-Tailed Image Categorization

Richard Franklin,Jiawei Yao,Deyang Zhong,Qi Qian,Juhua Hu

2024-09-05

Abstract:In many real-world applications, the frequency distribution of class labels for training data can exhibit a long-tailed distribution, which challenges traditional approaches of training deep neural networks that require heavy amounts of balanced data. Gathering and labeling data to balance out the class label distribution can be both costly and time-consuming. Many existing solutions that enable ensemble learning, re-balancing strategies, or fine-tuning applied to deep neural networks are limited by the inert problem of few class samples across a subset of classes. Recently, vision-language models like CLIP have been observed as effective solutions to zero-shot or few-shot learning by grasping a similarity between vision and language features for image and text pairs. Considering that large pre-trained vision-language models may contain valuable side textual information for minor classes, we propose to leverage text supervision to tackle the challenge of long-tailed learning. Concretely, we propose a novel text-guided mixup technique that takes advantage of the semantic relations between classes recognized by the pre-trained text encoder to help alleviate the long-tailed problem. Our empirical study on benchmark long-tailed tasks demonstrates the effectiveness of our proposal with a theoretical guarantee. Our code is available at <a class="link-external link-https" href="https://github.com/rsamf/text-guided-mixup" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of model performance bias caused by the imbalance in the number of category samples when performing image classification on long-tailed distribution datasets. Specifically, the paper proposes a Text-Guided Mixup technique, which leverages the text encoder in the pre-trained CLIP model to enhance the performance of image classification tasks on long-tailed distribution data. By incorporating textual information, this method can better handle the problem of insufficient samples in tail categories during training, thereby improving overall classification performance. Experimental results show that this method outperforms existing methods on multiple long-tailed datasets.

Text-Guided Mixup Towards Long-Tailed Image Categorization

Mix from Failure: Confusion-Pairing Mixup for Long-Tailed Recognition

MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models

Enhanced Long-Tailed Recognition with Contrastive CutMix Augmentation

Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition

The Solution for Language-Enhanced Image New Category Discovery

Mixed Mutual Transfer for Long-Tailed Image Classification

Bt-Vmf Contrastive and Collaborative Learning for Long-Tailed Visual Recognition

ITMix: Image-Text Mix Augmentation for Transferring CLIP to Image Classification

Text as Image: Learning Transferable Adapter for Multi-Label Classification

LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition

Text-Guided Diverse Image Synthesis for Long-Tailed Remote Sensing Object Classification

Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model

Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification

DiffCLIP: Few-shot Language-driven Multimodal Classifier

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

NCL++: Nested Collaborative Learning for long-tailed visual recognition

Increasing Oversampling Diversity for Long-Tailed Visual Recognition.

Balanced Contrastive Learning for Long-Tailed Visual Recognition

Diverse and Tailored Image Generation for Zero-shot Multi-label Classification