Abstract:Recent advances in fine-tuning large-scale vision-language pre-trained models (VL-PTMs) have shown promising results in quick adaption to downstream tasks. However, prior research often lacks comprehensive investigation into out-of-distribution (OOD) generalization. Fine-tuning has a potential risk of overfitting, especially on few-shot OOD datasets when significant distribution shifts occur between the few-shot training examples and test sets. Previous research on fine-tuning's robustness to distribution shifts does not consider different characteristics of distribution shifts and may not effectively handle noisy data with spurious correlations. To address these challenges, we propose the Vision-Language Alignment Learning under Affinity and Divergence Principles (VLAD) to adapt VL-PTMs to robust few-shot OOD generalization with theoretical guarantees. Built upon the large-scale pre-trained vision-language foundation model CLIP, we leverage frozen language embeddings as invariant anchors to protect against distribution shifts, while using adapter layers to fine-tune pre-trained visual features for improved vision-language alignment. Besides, we introduce affinity and divergence principles to further mitigate overfitting during the vision-language aligning process by increasing class discrimination and suppressing non-causal features. More importantly, we offer theoretical evidence highlighting the superiority of general language knowledge in achieving more robust OOD generalization performance. The tighter upper bound of the OOD generalization errors by the proposed regularization loss is also shown in theoretical analysis. Our approach is substantiated by extensive experiments and ablation studies on diverse datasets, validating our theoretical findings. The code is available at https://github.com/LinLLLL/VLAD.

CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?

Fine-Tuning Deteriorates General Textual Out-of-Distribution Detection by Distorting Task-Agnostic Features

Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization

Enhancing Out-of-distribution Detection Via Diversified Multi-Prototype Contrastive Learning

General-Purpose Multi-Modal OOD Detection Framework

Towards Effective Semantic OOD Detection in Unseen Domains: A Domain Generalization Perspective

Towards Few-shot Out-of-Distribution Detection

Topology-aware Robust Optimization for Out-of-distribution Generalization

Enhancing Outlier Knowledge for Few-Shot Out-of-Distribution Detection with Extensible Local Prompts

AutoFT: Learning an Objective for Robust Fine-Tuning

Self-Calibrated Tuning of Vision-Language Models for Out-of-Distribution Detection

Improving Oriented Object Detection by Scene Classification and Task-Aligned Focal Loss

Feed Two Birds with One Scone: Exploiting Wild Data for Both Out-of-Distribution Generalization and Detection

HyperDPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework

Enhancing Few-Shot Out-of-Distribution Detection with Gradient Aligned Context Optimization

Bridging OOD Detection and Generalization: A Graph-Theoretic View

The Best of Both Worlds: On the Dilemma of Out-of-distribution Detection

Visual Out-of-Distribution Detection in Open-Set Noisy Environments

Anchor-based Robust Finetuning of Vision-Language Models