Abstract:Recent research on fine-tuning vision-language models has demonstrated impressive performance in various downstream tasks. However, the challenge of obtaining accurately labeled data in real-world applications poses a significant obstacle during the fine-tuning process. To address this challenge, this paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models. DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels. The proposed framework establishes a noisy label detector by learning positive and negative textual prompts for each class. The positive prompt seeks to reveal distinctive features of the class, while the negative prompt serves as a learnable threshold for separating clean and noisy samples. We employ parameter-efficient fine-tuning for the adaptation of a pre-trained visual encoder to promote its alignment with the learned textual prompts. As a general framework, DeFT can seamlessly fine-tune many pre-trained models to downstream tasks by utilizing carefully selected clean samples. Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the challenge of obtaining accurate label data in real-world applications, particularly when fine-tuning vision-language models. Specifically, the paper proposes a framework called DEFT (Denoising Fine-Tuning) to adapt vision-language models to filter out noisy labels. DEFT leverages the strong alignment between pre-trained text and visual features from large-scale auxiliary image-text pairs by learning positive and negative text prompts for each category to detect noisy labels. Positive prompts aim to reveal the unique characteristics of the category, while negative prompts serve as a learnable threshold to separate clean samples from noisy ones. ### Main Contributions 1. **Proposing the DEFT Framework**: DEFT is a simple yet effective framework for handling noisy labels. It has the following advantages: - Instance dependency (no need for information from the entire training data) - Robustness to various types of noisy labels - Applicability to multiple pre-trained models 2. **Extensive Experimental Validation**: The paper conducts experiments on multiple synthetic and real-world noisy datasets, demonstrating DEFT's superior performance in noisy label detection and image classification tasks. 3. **In-depth Empirical Analysis**: Provides detailed empirical analysis to help understand the effectiveness of DEFT and hopes to offer references for future research. ### Solutions 1. **Noisy Label Detection**: - **Dual Prompt Strategy**: Design positive and negative text prompts for each category, identifying noisy labels by calculating the similarity between image features and the positive and negative prompts. - **Optimizing Noisy Label Detector**: Optimize positive and negative prompts by constructing positive and negative samples for a binary classification task, further eliminating the impact of noisy labels on representation. 2. **Model Adaptation**: - **Fine-tuning with Clean Samples**: After identifying clean samples in the first stage, use these samples for full fine-tuning (FFT) to further enhance visual recognition performance. - **Linear Classifier**: Learn a linear classifier, fine-tuning with the selected clean samples to improve the model's performance in downstream tasks. ### Experimental Results The paper conducts experiments on multiple synthetic and real-world noisy datasets, including CIFAR-100, Tiny-ImageNet, Stanford-Cars, CUB-200-2011, etc. The experimental results show that DEFT performs excellently in both noisy label detection and image classification tasks, significantly outperforming other sample selection strategies. ### Conclusion By proposing the DEFT framework, the paper effectively addresses the challenge of obtaining accurate label data in real-world applications, particularly when fine-tuning vision-language models. DEFT leverages pre-trained multimodal information and, through a dual prompt strategy and model adaptation stage, achieves effective detection and handling of noisy labels, providing new ideas and methods for future research.

Vision-Language Models are Strong Noisy Label Detectors

RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

Learning with Noisy Labels Via Self-supervised Adversarial Noisy Masking

VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness

Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?

Pre-Trained Vision-Language Models as Partial Annotators

Tuning Vision-Language Models with Candidate Labels by Prompt Alignment

Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise

Noise-Robust Fine-Tuning of Pretrained Language Models via External Guidance

Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance

Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models

A survey of efficient fine-tuning methods for Vision-Language Models — Prompt and Adapter

Improved Visual Fine-tuning with Natural Language Supervision

Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

Anchor-based Robust Finetuning of Vision-Language Models

Learning to Decompose Visual Features with Latent Textual Prompts

Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks