Abstract:Recent advances achieved by deep learning models rely on the independent and identically distributed assumption, hindering their applications in real-world scenarios with domain shifts. To tackle this issue, cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data. However, in visual cross-domain learning, traditional methods concentrate solely on the image modality, disregarding the potential benefits of incorporating the text modality. In this work, we propose VLLaVO, combining Vision language models and Large Language models as Visual cross-dOmain learners. VLLaVO uses vision-language models to convert images into detailed textual descriptions. A large language model is then finetuned on textual descriptions of the source/target domain generated by a designed instruction template. Extensive experimental results under domain generalization and unsupervised domain adaptation settings demonstrate the effectiveness of the proposed method.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the cross - domain learning problem in the visual field, especially how to improve the generalization ability of the model when there is a domain shift in visual data. Specifically, the paper proposes solutions to the following two aspects of problems: 1. **Limitations of existing methods**: Traditional cross - domain learning methods mainly focus on the image modality and ignore the possible benefits of the text modality. These methods perform poorly in dealing with the differences between different domains, especially when there is a large distribution difference between the target domain and the source domain. 2. **Combination of visual and language modalities**: Although large - scale vision - language models (VLMs) have shown significant improvement in tasks such as image classification, their application in cross - domain learning is still limited. In addition, although large - language models (LLMs) perform excellently in natural - language - processing tasks, their application in pure - visual or vision - language tasks is also limited. To solve the above problems, the paper proposes **VLLaVO** (Vision Language Large model for Visual cross - dOmain learning), which alleviates the cross - domain problem in the visual field by combining vision - language models (VLMs) and large - language models (LLMs). Specifically, the main contributions of VLLaVO include: - **Applying LLMs to visual cross - domain learning for the first time**: Convert images into detailed text descriptions through VLMs and design special instruction templates to query LLM to predict image categories. - **Utilizing the inherent generalization ability of LLM**: Fine - tune LLM to enhance its instruction - following ability and reduce the interference of irrelevant context, so as to effectively deal with cross - domain problems. - **Experimental verification**: Extensive experiments on multiple benchmark datasets show that VLLaVO has achieved state - of - the - art performance in both domain generalization (DG) and unsupervised domain adaptation (UDA) tasks, surpassing the existing VLM - based methods. In summary, this paper aims to propose a new framework to improve the performance of visual cross - domain learning by combining the advantages of visual and language modalities, especially in the case of domain shift.

VLLaVO: Mitigating Visual Gap through LLMs

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bridging Vision and Language Spaces with Assignment Prediction

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

VoCo-LLaMA: Towards Vision Compression with Large Language Models

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

An Introduction to Vision-Language Modeling

A-VL: Adaptive Attention for Large Vision-Language Models

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Unified Lexical Representation for Interpretable Visual-Language Alignment

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

Enhancing Advanced Visual Reasoning Ability of Large Language Models

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Visually-Augmented Language Modeling