Abstract:Recent advancements in text-to-image generation have inspired researchers to generate datasets tailored for perception models using generative models, which prove particularly valuable in scenarios where real-world data is limited. In this study, our goal is to address the challenges when fine-tuning vision-language models (e.g., CLIP) on generated datasets. Specifically, we aim to fine-tune vision-language models to a specific classification model without access to any real images, also known as name-only transfer. However, despite the high fidelity of generated images, we observed a significant performance degradation when fine-tuning the model using the generated datasets due to the domain gap between real and generated images. To overcome the domain gap, we provide two regularization methods for training and post-training, respectively. First, we leverage the domain-agnostic knowledge from the original pre-trained vision-language model by conducting the weight-space ensemble of the fine-tuned model on the generated dataset with the original pre-trained model at the post-training. Secondly, we reveal that fine-tuned models with high feature diversity score high performance in the real domain, which indicates that increasing feature diversity prevents learning the generated domain-specific knowledge. Thus, we encourage feature diversity by providing additional regularization at training time. Extensive experiments on various classification datasets and various text-to-image generation models demonstrated that our analysis and regularization techniques effectively mitigate the domain gap, which has long been overlooked, and enable us to achieve state-of-the-art performance by training with generated images. Code is available at <a class="link-external link-https" href="https://github.com/pmh9960/regft-for-gen" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the domain gap problem encountered when fine - tuning vision - language models (such as CLIP) using only generated datasets. Specifically, the research objective is to fine - tune vision - language models for specific classification tasks using generated datasets without accessing any real images, namely the so - called "name - only transfer". Although the generated images have high fidelity, a significant performance degradation can still be observed during the fine - tuning process, mainly due to the domain gap between the generated images and the real images. To solve this problem, the paper proposes two regularization methods: regularization during training and post - training regularization. The details are as follows: ### 1. Post - training regularization: Weight - space Ensemble To overcome the domain gap, the author utilizes the domain - independent knowledge of the pre - trained vision - language model (such as CLIP) to perform weight - space ensemble on the fine - tuned model. The specific formula is as follows: \[ WSE(\theta_{ZS}, \theta_{FT})=(1 - \alpha)\cdot\theta_{ZS}+\alpha\cdot\theta_{FT} \] where: - \(\theta_{ZS}\) is the parameter of the zero - shot CLIP classifier, - \(\theta_{FT}\) is the parameter of the classifier fine - tuned on the generated dataset, - \(\alpha\) is the weight mixing coefficient, which determines the integration ratio between the two classifiers. This method combines the domain - independent characteristics of the zero - shot CLIP classifier and the task - specific knowledge learned in the generated dataset through linear interpolation. ### 2. Regularization during training: Variance - Covariance Regularization The author finds that if the fine - tuned model has higher feature diversity, it will perform better in the real domain. Therefore, they introduce variance - covariance regularization to increase feature diversity. The specific formula is as follows: \[ L_{VCR}=\lambda_{Var}\cdot\frac{1}{D}\sum_{i = 1}^{D}\max(0,1-\sqrt{C_{ii}})+\lambda_{Cov}\cdot\frac{1}{D(D - 1)}\sum_{i\neq j}C_{ij}^2 \] where: - \(C\) is the covariance matrix of the mini - batch data, - \(D\) is the dimension of the embedded features, - \(\lambda_{Var}\) and \(\lambda_{Cov}\) are the intensities of variance and covariance regularization respectively. This regularization method improves feature diversity by increasing the diagonal elements of the covariance matrix and reducing the non - diagonal elements, thereby enhancing the performance in the real domain. ### Experimental Results The paper verifies the effectiveness of these methods through extensive experiments. The experiments cover 11 different datasets and use three different text - to - image generation models (DALL - E, Stable Diffusion 2.1 and Stable Diffusion XL). The results show that the proposed methods significantly outperform the existing name - only transfer baseline methods on multiple datasets and also perform well in few - shot classification tasks. ### Summary This paper aims to solve the domain gap problem between the generated dataset and the real data by introducing two regularization methods, thereby achieving efficient fine - tuning of vision - language models in the name - only transfer scenario. These methods not only improve the performance of the model in the real domain but also demonstrate the potential of the generated dataset in data - scarce scenarios.

Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models

Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

Robust Fine-Tuning of Vision-Language Models for Domain Generalization

Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization

Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning

Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance

Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization

Enhancing Vision-Language Models Generalization via Diversity-Driven Novel Feature Synthesis

Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization

Text-Driven Generative Domain Adaptation with Spectral Consistency Regularization

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction

LFR-GAN: Local Feature Refinement based Generative Adversarial Network for Text-to-Image Generation

Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models

UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

DataDream: Few-shot Guided Dataset Generation

Leaving Reality to Imagination: Robust Classification via Generated Datasets