VLLaVO: Mitigating Visual Gap through LLMs

Shuhao Chen,Yulong Zhang,Weisen Jiang,Jiangang Lu,Yu Zhang
2024-03-17
Abstract:Recent advances achieved by deep learning models rely on the independent and identically distributed assumption, hindering their applications in real-world scenarios with domain shifts. To tackle this issue, cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data. However, in visual cross-domain learning, traditional methods concentrate solely on the image modality, disregarding the potential benefits of incorporating the text modality. In this work, we propose VLLaVO, combining Vision language models and Large Language models as Visual cross-dOmain learners. VLLaVO uses vision-language models to convert images into detailed textual descriptions. A large language model is then finetuned on textual descriptions of the source/target domain generated by a designed instruction template. Extensive experimental results under domain generalization and unsupervised domain adaptation settings demonstrate the effectiveness of the proposed method.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the cross - domain learning problem in the visual field, especially how to improve the generalization ability of the model when there is a domain shift in visual data. Specifically, the paper proposes solutions to the following two aspects of problems: 1. **Limitations of existing methods**: Traditional cross - domain learning methods mainly focus on the image modality and ignore the possible benefits of the text modality. These methods perform poorly in dealing with the differences between different domains, especially when there is a large distribution difference between the target domain and the source domain. 2. **Combination of visual and language modalities**: Although large - scale vision - language models (VLMs) have shown significant improvement in tasks such as image classification, their application in cross - domain learning is still limited. In addition, although large - language models (LLMs) perform excellently in natural - language - processing tasks, their application in pure - visual or vision - language tasks is also limited. To solve the above problems, the paper proposes **VLLaVO** (Vision Language Large model for Visual cross - dOmain learning), which alleviates the cross - domain problem in the visual field by combining vision - language models (VLMs) and large - language models (LLMs). Specifically, the main contributions of VLLaVO include: - **Applying LLMs to visual cross - domain learning for the first time**: Convert images into detailed text descriptions through VLMs and design special instruction templates to query LLM to predict image categories. - **Utilizing the inherent generalization ability of LLM**: Fine - tune LLM to enhance its instruction - following ability and reduce the interference of irrelevant context, so as to effectively deal with cross - domain problems. - **Experimental verification**: Extensive experiments on multiple benchmark datasets show that VLLaVO has achieved state - of - the - art performance in both domain generalization (DG) and unsupervised domain adaptation (UDA) tasks, surpassing the existing VLM - based methods. In summary, this paper aims to propose a new framework to improve the performance of visual cross - domain learning by combining the advantages of visual and language modalities, especially in the case of domain shift.