Abstract:Recent advances in fine-tuning large-scale vision-language pre-trained models (VL-PTMs) have shown promising results in quick adaption to downstream tasks. However, prior research often lacks comprehensive investigation into out-of-distribution (OOD) generalization. Fine-tuning has a potential risk of overfitting, especially on few-shot OOD datasets when significant distribution shifts occur between the few-shot training examples and test sets. Previous research on fine-tuning's robustness to distribution shifts does not consider different characteristics of distribution shifts and may not effectively handle noisy data with spurious correlations. To address these challenges, we propose the Vision-Language Alignment Learning under Affinity and Divergence Principles (VLAD) to adapt VL-PTMs to robust few-shot OOD generalization with theoretical guarantees. Built upon the large-scale pre-trained vision-language foundation model CLIP, we leverage frozen language embeddings as invariant anchors to protect against distribution shifts, while using adapter layers to fine-tune pre-trained visual features for improved vision-language alignment. Besides, we introduce affinity and divergence principles to further mitigate overfitting during the vision-language aligning process by increasing class discrimination and suppressing non-causal features. More importantly, we offer theoretical evidence highlighting the superiority of general language knowledge in achieving more robust OOD generalization performance. The tighter upper bound of the OOD generalization errors by the proposed regularization loss is also shown in theoretical analysis. Our approach is substantiated by extensive experiments and ablation studies on diverse datasets, validating our theoretical findings. The code is available at https://github.com/LinLLLL/VLAD.

Anchor-based Robust Finetuning of Vision-Language Models

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

Neural Collapse Anchored Prompt Tuning for Generalizable Vision-Language Models

Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization

How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

Robust Fine-Tuning of Vision-Language Models for Domain Generalization

Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification

Towards Compatible Fine-tuning for Vision-Language Model Updates

Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization

CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models

SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning

Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance

Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models

Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

Context-Aware Robust Fine-Tuning

Towards Realistic Unsupervised Fine-tuning with CLIP

CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection

Tuning Vision-Language Models with Multiple Prototypes Clustering

VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness