Abstract:General-purpose foundation models have led to recent breakthroughs in artificial intelligence (AI). In remote sensing, self-supervised learning (SSL) and masked image modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pretraining data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on box-to-caption (B2C) and mask-to-box (M2B) conversion. By further incorporating unmanned aerial vehicle (UAV) imagery, we produce a larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, k-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation of 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art (SOTA) method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the contrastive language image pretraining (CLIP) baseline by up to 6.39% average accuracy on 12 downstream datasets.

SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces

How Much Can CLIP Benefit Vision-and-Language Tasks?

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision