Abstract:Vast amounts of remote sensing (RS) data provide Earth observations across multiple dimensions, encompassing critical spatial, temporal, and spectral information which is essential for addressing global-scale challenges such as land use monitoring, disaster prevention, and environmental change mitigation. Despite various pre-training methods tailored to the characteristics of RS data, a key limitation persists: the inability to effectively integrate spatial, temporal, and spectral information within a single unified model. To unlock the potential of RS data, we construct a Spatial-Temporal-Spectral Structured Dataset (STSSD) characterized by the incorporation of multiple RS sources, diverse coverage, unified locations within image sets, and heterogeneity within images. Building upon this structured dataset, we propose an Anchor-Aware Masked AutoEncoder method (A$^{2}$-MAE), leveraging intrinsic complementary information from the different kinds of images and geo-information to reconstruct the masked patches during the pre-training phase. A$^{2}$-MAE integrates an anchor-aware masking strategy and a geographic encoding module to comprehensively exploit the properties of RS images. Specifically, the proposed anchor-aware masking strategy dynamically adapts the masking process based on the meta-information of a pre-selected anchor image, thereby facilitating the training on images captured by diverse types of RS sources within one model. Furthermore, we propose a geographic encoding method to leverage accurate spatial patterns, enhancing the model generalization capabilities for downstream applications that are generally location-related. Extensive experiments demonstrate our method achieves comprehensive improvements across various downstream tasks compared with existing RS pre-training methods, including image classification, semantic segmentation, and change detection tasks.

SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation

Masked Autoencoders for Point Cloud Self-supervised Learning.

LR-MAE: Locate While Reconstructing with Masked Autoencoders for Point Cloud Self-supervised Learning

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote Sensing

CephalFormer: Incorporating Global Structure Constraint into Visual Features for General Cephalometric Landmark Detection

SdAE: Self-distillated Masked Autoencoder

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

Robust and Precise Facial Landmark Detection by Self-Calibrated Pose Attention Network

Ensemble Based Constrained-Optimization Extreme Learning Machine For Landmark Recognition

Teaching Masked Autoencoder With Strong Augmentations

Understanding Masked Autoencoders From a Local Contrastive Perspective

SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote Sensing Image Classification

Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing

SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners

Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture

Task-customized Masked AutoEncoder via Mixture of Cluster-conditional Experts

PCP-MAE: Learning to Predict Centers for Point Masked Autoencoders

A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

MonoMAE: Enhancing Monocular 3D Detection through Depth-Aware Masked Autoencoders