SeaMo: A Multi-Seasonal and Multimodal Remote Sensing Foundation Model

Xuyang Li,Danfeng Hong,Chenyu Li,Jocelyn Chanussot
2024-12-26
Abstract:Remote Sensing (RS) data contains a wealth of multi-dimensional information crucial for Earth observation. Owing to its vast volume, diverse sources, and temporal properties, RS data is highly suitable for the development of large Visual Foundation Models (VFMs). VFMs act as robust feature extractors, learning from extensive RS data, and are subsequently fine-tuned for deployment in various geoscientific tasks. However, current VFMs in the RS domain are predominantly pretrained and tailored exclusively for specific characteristics of RS imagery, neglecting the potential of utilizing the multi-dimensional properties of RS data. Therefore, in this work, we propose SeaMo, a pioneering visual foundation model that integrates multi-seasonal and multimodal information in the RS field. SeaMo is designed to harness multiple properties of RS data. Within the masked image modeling framework, we employ non-aligned cropping techniques to extract spatial properties, use multi-source inputs for multimodal integration, and incorporate temporal-multimodal fusion blocks for effective assimilation of multi-seasonal data. SeaMo explicitly models the multi-dimensional properties of RS data, making the model more comprehensive, robust, and versatile. We applied SeaMo to several downstream geoscience tasks, which demonstrated exceptional performance. Extensive ablation studies were conducted to validate the model's superiority.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on how to effectively utilize multi - source satellite data and explicitly model the spatial and temporal characteristics of remote sensing data. Specifically, the existing Visual Foundation Models (VFMs) have the following problems when processing remote sensing data: 1. **Lack of Geoscience Attributes**: Some VFMs in the remote sensing field are derived from computer vision techniques and usually involve simple conversion of remote sensing data to meet the requirements of algorithms. This has led to insufficient exploration of the inherent geoscience characteristics in remote sensing data, thus affecting the feature extraction ability. 2. **Single - Dimensional Data Modeling**: Most VFMs, although acknowledging the unique characteristics of remote sensing data, usually only focus on a single dimension, such as spatial or temporal attributes. For example, SeCo and CaCo use Contrastive Learning (CL) to capture time - invariant features at specific geographical locations, while RingMo, SatMAE, etc. perform Masked Image Modeling (MIM) for small objects or images with different resolutions. 3. **Non - Explicit Multi - Dimensional Data Attribute Modeling**: Some VFMs consider multi - dimensional attributes, but in the modeling and pre - training process, they are more like stacking and assembling rather than explicitly modeling from the perspective of remote sensing and geoscience. This approach often leads to shallow integration of multi - dimensional data and fails to fully utilize the synergetic potential among spatial, temporal and spectral characteristics. To solve these problems, the paper proposes SeaMo, an innovative foundation model, which aims to effectively integrate multi - seasonal and multi - modal information in remote sensing data. By introducing unaligned cropping techniques, multi - source inputs and spatio - temporal fusion blocks, SeaMo can handle the multi - dimensional characteristics of remote sensing data more comprehensively, robustly and flexibly. ### Specific Problems and Solutions 1. **Integration of Multi - Source Data**: - **Problem**: Different methods capture the interaction between electromagnetic radiation and earth - surface materials, providing unique data sets. Optical data provide detailed spectral information, and SAR data evaluate the geometric, roughness and electrical properties of objects. The significant heterogeneity between these data sources makes simple data conversion and combination insufficient to fully utilize this information. - **Solution**: SeaMo uses a unified encoder to process multi - source data and promotes information fusion through the self - attention mechanism, thereby achieving deeper multi - modal feature extraction. 2. **Effective Modeling of Spatio - Temporal Data**: - **Problem**: The spatio - temporal structure of remote sensing data makes it possible to analyze physical geographical changes and land use situations, but due to low spatial resolution and long observation intervals, simple techniques are difficult to capture time dynamics. - **Solution**: SeaMo designs a spatio - temporal fusion block with a cross - attention mechanism, which effectively integrates multiple spatio - temporal data streams and enhances the learning of time - invariant representations. 3. **Explicit Modeling of Multi - Dimensional Data Attributes**: - **Problem**: Existing models are more about stacking and assembling during modeling and pre - training, and fail to fully utilize the synergetic potential among spatial, temporal and spectral characteristics. - **Solution**: SeaMo explicitly models the characteristics of multi - seasonal and multi - modal data through the partially overlapping cropping technique and the masked image modeling framework, ensuring that the model comprehensively understands the data from multiple dimensions. Through these improvements, SeaMo not only performs well in multiple downstream tasks, but also verifies its superior performance in ablation experiments.