Abstract:Intelligent interpretation of remote sensing images using deep learning is heavily reliant on large datasets, and models trained in one domain often struggle with crossdomain application. Pretraining the backbone network via masked image modeling can effectively diminish this reliance on extensive sample data, thereby reducing crossdomain transfer obstacles. However, current masked image models typically employ a pure Transformer architecture, which may not fully capitalize on low-level features. To address these issues, this article proposes masked feature modeling (MFM), a methodology for the generative self-supervised learning of high-resolution remote sensing images that combines convolutional neural network (CNN) and Transformer architectures. This methodology has several advantages: 1) The hybrid CNN + Transformer architecture not only retains the advantages of the local feature representation of the CNN architecture but also has the full-text information modeling capabilities of the Transformer architecture; 2) the feature extraction network outputs multiscale features, and it is easier to add upsampling and a skip connection to improve the accuracy of the downstream dense prediction task; and 3) the pretrained MFM can be applied to various downstream tasks through fine-tuning with limited samples. The publicly available WHU and Massachusetts Building Datasets are used to verify the effectiveness of the proposed method. Extensive experiments involving main properties of the MFM for generative self-supervised learning, fine-tuning the MFM on the downstream semantic segmentation task, and comparisons with the other state-of-the-art generative self-supervised learning algorithms show that, through the combined advantages of the CNN and Transformer architectures, the proposed method has better feature extraction capability and higher accuracy on downstream tasks such as semantic segmentation.

Reconstruction Target Matters in Masked Image Modeling for Cross-Domain Few-Shot Learning

When Masked Image Modeling Meets Source-free Unsupervised Domain Adaptation: Dual-Level Masked Network for Semantic Segmentation

Point Cloud Domain Adaptation Via Masked Local 3D Structure Prediction

TACDFSL: Task Adaptive Cross Domain Few-Shot Learning

Gradient-Guided Channel Masking for Cross-Domain Few-Shot Learning

How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders

Enhancing Information Maximization with Distance-Aware Contrastive Learning for Source-Free Cross-Domain Few-Shot Learning

Stare at What You See: Masked Image Modeling Without Reconstruction

Understanding Masked Autoencoders From a Local Contrastive Perspective

Lightweight Frequency Masker for Cross-Domain Few-Shot Semantic Segmentation

MFAE: Masked Frequency Autoencoders for Domain Generalization Face Anti-spoofing

CL-MAE: Curriculum-Learned Masked Autoencoders

Spectral Decomposition and Transformation for Cross-domain Few-shot Learning

Masked Autoencoders are Parameter-Efficient Federated Continual Learners

Masked Feature Modeling for Generative Self-Supervised Representation Learning of High-Resolution Remote Sensing Images

Masked Image Modeling with Local Multi-Scale Reconstruction.

Contrastive Masked Autoencoders are Stronger Vision Learners

LMD: Faster Image Reconstruction with Latent Masking Diffusion

Revisiting Mid-Level Patterns for Cross-Domain Few-Shot Recognition.

PR-MIM: Delving Deeper into Partial Reconstruction in Masked Image Modeling

ME-D2N: Multi-Expert Domain Decompositional Network for Cross-Domain Few-Shot Learning