Abstract:Intelligent interpretation of remote sensing images using deep learning is heavily reliant on large datasets, and models trained in one domain often struggle with crossdomain application. Pretraining the backbone network via masked image modeling can effectively diminish this reliance on extensive sample data, thereby reducing crossdomain transfer obstacles. However, current masked image models typically employ a pure Transformer architecture, which may not fully capitalize on low-level features. To address these issues, this article proposes masked feature modeling (MFM), a methodology for the generative self-supervised learning of high-resolution remote sensing images that combines convolutional neural network (CNN) and Transformer architectures. This methodology has several advantages: 1) The hybrid CNN + Transformer architecture not only retains the advantages of the local feature representation of the CNN architecture but also has the full-text information modeling capabilities of the Transformer architecture; 2) the feature extraction network outputs multiscale features, and it is easier to add upsampling and a skip connection to improve the accuracy of the downstream dense prediction task; and 3) the pretrained MFM can be applied to various downstream tasks through fine-tuning with limited samples. The publicly available WHU and Massachusetts Building Datasets are used to verify the effectiveness of the proposed method. Extensive experiments involving main properties of the MFM for generative self-supervised learning, fine-tuning the MFM on the downstream semantic segmentation task, and comparisons with the other state-of-the-art generative self-supervised learning algorithms show that, through the combined advantages of the CNN and Transformer architectures, the proposed method has better feature extraction capability and higher accuracy on downstream tasks such as semantic segmentation.

Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning

Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling

HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training

Point Cloud Domain Adaptation Via Masked Local 3D Structure Prediction

SparseMAE: Sparse Training Meets Masked Autoencoders.

Inducing Semi-Structured Sparsity by Masking for Efficient Model Inference in Convolutional Networks

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation

Effective Sparsification of Neural Networks with Global Sparsity Constraint

SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling

Masked Feature Modeling for Generative Self-Supervised Representation Learning of High-Resolution Remote Sensing Images

Exploring Fine-Grained Sparsity in Convolutional Neural Networks for Efficient Inference

Generative ConvNet Foundation Model With Sparse Modeling and Low-Frequency Reconstruction for Remote Sensing Image Interpretation

Masked Channel Modeling for Bootstrapping Visual Pre-training

Learning with Unmasked Tokens Drives Stronger Vision Learners

Integrating Convolution and Sparse Coding for Learning Low-Dimensional Discriminative Image Representations

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Masked Graph Modeling with Multi- View Contrast

Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning

SparX: A Sparse Cross-Layer Connection Mechanism for Hierarchical Vision Mamba and Transformer Networks

Revisiting Sparse Convolutional Model for Visual Recognition