Abstract:Remote sensing images are widely used in land monitoring, environmental perception, disaster prediction and urban analysis. Most commercial satellites such as WorldView-4, QuickBird and WorldView-2 are equipped with sensors that can obtain panchromatic images and multispectral images at the same time. Panchromatic images have high spatial resolution but have only one band. The spatial resolution of multispectral images is low due to the bandwidth limitation of the equipment. In order to obtain more accurate details of the measured object, panchromatic image and multispectral image can be fused to generate images with both high spatial resolution and high spectral resolution. Fusion methods of multispectral and panchromatic images can be divided into four categories: multi-resolution analysis method, component substitution method, variational optimization method and deep learning method. Compared with traditional methods, deep learning has stronger feature extraction ability, so it is widely used. Currently, transformer structure is introduced into advanced remote sensing image fusion method. Aiming at the problem that existing methods based on transformer fail to fully integrate multi-scale features of remote sensing images, this paper proposes a multispectral-panchromatic fusion network MSCANet, based on improved Swin transformer. The model extracts features of multispectral images and panchromatic images respectively by using two-flow branches. The downsampled feature images are cascaded and fed into the fusion network. In order to improve the robustness of feature extraction in various complex ground scenes, a Multiscale Swin-transformer with Channel Attention (MSCA) unit is integrated in the fusion part. The unit replaces the MLP part of Swin transformer into a cascade module of multi-scale convolution and channel attention, which can better fuse the feature information of ground objects of different sizes in remote sensing images and use the long-range dependence between regions. The fusion network focus on predicting the high-frequency details lost in multispectral images. Then high frequency details are added to the original image to restore a high resolution multispectral image. Simulation experiment and real experiment of three commercial satellites are conducted. In the experiment of simulation data, the fusion results were evaluated by calculating the difference between the reference image and the simulation dataset. Compared with other methods, MSCANet has the best performance in visual performance and quantitative metrics. Compared with the method with the second performance, the ERGAS index of MSCANet in the three datasets decreased by 11.99%, 0.4% and 3.43%, respectively. In the experiment of three real datasets, combining visual effect and quantitative metrics analysis, the result of MSCANet is the best. Ablation experiments were conducted for the three fusion strategies proposed in this paper. The experimental result shows that the injected model used in this paper outperforms the non-injected model. It also proves that the replacement of MLP module in MSCA module and the addition of attention mechanism are conducive to the improvement of fusion performance. Also, the addition of spectral loss and spatial structure loss on the basis of MAE loss is effective for the improvement of spectral fidelity and spatial resolution. In conclusion, the effectiveness of the proposed method was verified by comparison and ablation experiments. In future work, MSCANet is expected to be migrated to the fusion of multispectral image and hyperspectral image, visible image and infrared image, and other similar tasks to improve the generalization of the model proposed in this paper.

SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery

SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery

USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery

Multiscale Feature Learning by Transformer for Building Extraction From Satellite Images

Swin2-MoSE: A New Single Image Super-Resolution Model for Remote Sensing

Swin MAE: Masked Autoencoders for Small Datasets

Local-enhanced multi-scale aggregation swin transformer for semantic segmentation of high-resolution remote sensing images

Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning

SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image Segmentation

Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing

MSST-Net: A Multi-Scale Adaptive Network for Building Extraction from Remote Sensing Images Based on Swin Transformer.

SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pretraining

A Semantic Segmentation Method for Remote Sensing Images Based on the Swin Transformer Fusion Gabor Filter

An Efficient Hybrid CNN-Transformer Approach for Remote Sensing Super-Resolution

TMNet: A Two-Branch Multi-Scale Semantic Segmentation Network for Remote Sensing Images

EDSD: Efficient Driving Scenes Detection Based on Swin Transformer

Semantic Segmentation of Remote Sensing Images With Transformer-Based U-Net and Guided Focal-Axial Attention

Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote Sensing Image Classification

Remote Sensing Image Fusion Method Based on Improved Swin Transformer