Vision-Language Models are Strong Noisy Label Detectors

Tong Wei,Hao-Tian Li,Chun-Shu Li,Jiang-Xin Shi,Yu-Feng Li,Min-Ling Zhang
2024-09-29
Abstract:Recent research on fine-tuning vision-language models has demonstrated impressive performance in various downstream tasks. However, the challenge of obtaining accurately labeled data in real-world applications poses a significant obstacle during the fine-tuning process. To address this challenge, this paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models. DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels. The proposed framework establishes a noisy label detector by learning positive and negative textual prompts for each class. The positive prompt seeks to reveal distinctive features of the class, while the negative prompt serves as a learnable threshold for separating clean and noisy samples. We employ parameter-efficient fine-tuning for the adaptation of a pre-trained visual encoder to promote its alignment with the learned textual prompts. As a general framework, DeFT can seamlessly fine-tune many pre-trained models to downstream tasks by utilizing carefully selected clean samples. Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the challenge of obtaining accurate label data in real-world applications, particularly when fine-tuning vision-language models. Specifically, the paper proposes a framework called DEFT (Denoising Fine-Tuning) to adapt vision-language models to filter out noisy labels. DEFT leverages the strong alignment between pre-trained text and visual features from large-scale auxiliary image-text pairs by learning positive and negative text prompts for each category to detect noisy labels. Positive prompts aim to reveal the unique characteristics of the category, while negative prompts serve as a learnable threshold to separate clean samples from noisy ones. ### Main Contributions 1. **Proposing the DEFT Framework**: DEFT is a simple yet effective framework for handling noisy labels. It has the following advantages: - Instance dependency (no need for information from the entire training data) - Robustness to various types of noisy labels - Applicability to multiple pre-trained models 2. **Extensive Experimental Validation**: The paper conducts experiments on multiple synthetic and real-world noisy datasets, demonstrating DEFT's superior performance in noisy label detection and image classification tasks. 3. **In-depth Empirical Analysis**: Provides detailed empirical analysis to help understand the effectiveness of DEFT and hopes to offer references for future research. ### Solutions 1. **Noisy Label Detection**: - **Dual Prompt Strategy**: Design positive and negative text prompts for each category, identifying noisy labels by calculating the similarity between image features and the positive and negative prompts. - **Optimizing Noisy Label Detector**: Optimize positive and negative prompts by constructing positive and negative samples for a binary classification task, further eliminating the impact of noisy labels on representation. 2. **Model Adaptation**: - **Fine-tuning with Clean Samples**: After identifying clean samples in the first stage, use these samples for full fine-tuning (FFT) to further enhance visual recognition performance. - **Linear Classifier**: Learn a linear classifier, fine-tuning with the selected clean samples to improve the model's performance in downstream tasks. ### Experimental Results The paper conducts experiments on multiple synthetic and real-world noisy datasets, including CIFAR-100, Tiny-ImageNet, Stanford-Cars, CUB-200-2011, etc. The experimental results show that DEFT performs excellently in both noisy label detection and image classification tasks, significantly outperforming other sample selection strategies. ### Conclusion By proposing the DEFT framework, the paper effectively addresses the challenge of obtaining accurate label data in real-world applications, particularly when fine-tuning vision-language models. DEFT leverages pre-trained multimodal information and, through a dual prompt strategy and model adaptation stage, achieves effective detection and handling of noisy labels, providing new ideas and methods for future research.