Abstract:Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the limitations of current Remote Sensing Foundation Models (RSFMs) when dealing with multi - modal time - series data. Specifically, existing RSFMs mainly focus on single - modal data and lack the modeling of temporal and geographical context, which restricts their application capabilities in diverse tasks. To this end, the paper proposes SkySense, a large - scale multi - modal Remote Sensing Foundation Model (MM - RSFM), to overcome these limitations. ### Main contributions 1. **Large - scale multi - modal model**: SkySense is a large - scale model with 2.06 billion parameters, pre - trained on a multi - modal dataset containing 21.5 million time - series of remote - sensing images. These datasets include High - Resolution Optical Images (HSROIs), Medium - Resolution Multispectral Images (TMsI) and Time - Series Synthetic Aperture Radar Images (TSARI). 2. **Modular design**: SkySense is designed with modular characteristics and can flexibly combine or use its individual modules separately to adapt to various tasks from single - modal to multi - modal, static to dynamic, classification to localization, etc. 3. **Multi - modal spatio - temporal encoder**: SkySense adopts a factorized multi - modal spatio - temporal encoder, which can independently extract spatial features and fuse multi - modal time - series data. This design significantly reduces the number of parameters while providing strong RSI sequence - modeling capabilities. 4. **Multi - granularity contrastive learning**: SkySense introduces a multi - granularity contrastive learning method, which can learn features at different modalities and spatial granularities, thus supporting diverse task requirements. 5. **Geographical context prototype learning**: In order to enhance the geographical context information of RSI representations, SkySense proposes a geographical context prototype learning method, which extracts region - aware prototypes from RSI features in an unsupervised manner. ### Experimental results SkySense demonstrates excellent performance on multiple datasets and tasks, including semantic segmentation, object detection, change detection and scene classification. Specifically: - **Semantic segmentation**: On four datasets (Dyna. - Pla., iSAID, Potsdam, Dyna. - S2), the average mIoU of SkySense is 1.86% higher than that of the previous best model. - **Object detection**: On three datasets (DIOR, DIOR - R, FAIR1M), the mAP of SkySense reaches 78.73%, 74.27% and 54.57% respectively, exceeding other models. - **Change detection**: On the LEVIR - CD, OSCD and Dyna. - S2 datasets, the F1 scores of SkySense are 92.58%, 60.06% and 15.4/18.0 respectively, showing excellent performance. - **Scene classification**: On four datasets (AID, RESISC - 45, BEN - S2, fMoW - S2), SkySense achieves the best results in all tasks. In particular, on the AID dataset, it can reach 97.68% OA using only 1% of the training data, which is 4.17% higher than the second - best model. ### Summary SkySense significantly enhances the ability to interpret remote - sensing images through its large - scale multi - modal pre - training, modular design, multi - granularity contrastive learning and geographical context prototype learning techniques, providing strong support for Earth - observation tasks.

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks

CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation

Foundation Models for Remote Sensing and Earth Observation: A Survey

SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model

SpectralGPT: Spectral Remote Sensing Foundation Model

SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery

Generative ConvNet Foundation Model With Sparse Modeling and Low-Frequency Reconstruction for Remote Sensing Image Interpretation

Semantic Relation Model and Dataset for Remote Sensing Scene Understanding

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

SkyScapes -- Fine-Grained Semantic Understanding of Aerial Scenes

SpectralEarth: Training Hyperspectral Foundation Models at Scale

Bridging Remote Sensors with Multisensor Geospatial Foundation Models

SynRS3D: A Synthetic Dataset for Global 3D Semantic Understanding from Monocular Remote Sensing Imagery

RingMo: A Remote Sensing Foundation Model with Masked Image Modeling

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

SatensoRF: Fast Satellite Tensorial Radiance Field for Multi-date Satellite Imagery of Large Size

Semantic Attention and Structured Model for Weakly Supervised Instance Segmentation in Optical and SAR Remote Sensing Imagery