Abstract:Pre-trained Vision-Language Models (VLMs) utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM), bridging the gap between the General Vision-Language Model (GVLM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we fine-tuned the CLIP model and tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DVLM. Experimental results show that our proposed dataset is highly effective for various tasks, and our model GeoRSCLIP improves upon the baseline or previous state-of-the-art model by $3\%\sim20\%$ in Zero-shot Classification (ZSC), $3\%\sim6\%$ in Remote Sensing Cross-Modal Text-Image Retrieval (RSCTIR) and $4\%\sim5\%$ in Semantic Localization (SeLo) tasks. Dataset and models have been released in: \url{<a class="link-external link-https" href="https://github.com/om-ai-lab/RS5M" rel="external noopener nofollow">this https URL</a>}.

Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

HyperCLIP: Adapting Vision-Language models with Hypernetworks

How Much Can CLIP Benefit Vision-and-Language Tasks?

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Scalable Performance Analysis for Vision-Language Models

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

Multilingual Diversity Improves Vision-Language Representations

Large-scale Bilingual Language-Image Contrastive Learning

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation

Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

Image As a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model

Learning the Visualness of Text Using Large Vision-Language Models

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing