A ViT-Based Multiscale Feature Fusion Approach for Remote Sensing Image Segmentation

Wen Wang,Chen Tang,Bin Zheng,Xin Wang
DOI: https://doi.org/10.1109/LGRS.2022.3187135
IF: 5.343
IEEE Geoscience and Remote Sensing Letters
Abstract:Semantic segmentation plays an indispensable role in automatic analysis of remote sensing image data. However, the abundant semantic information and irregular shape patterns in remote sensing images are difficult to utilize, making it hard to segment remote sensing images only using convolution and single-scale feature maps. To achieve better segmentation performance, a multiscale feature pyramid decoder (MFPD) is proposed to fuse image features extracted by vision transformer (ViT). The decoder employs a novel 2-D-to-3-D transform method to obtain multiscale feature maps that contain rich context information and fuses the multiscale feature maps by channel concatenation. Furthermore, a dimension attention module (DAM) is designed to further aggregate the context information of the extracted remote sensing image features. This approach yields superior mean intersection over union (mIoU) on the Gaofen2-CZ dataset (60.42%) and GID-5 dataset (68.21%). Experimental results indicate that the comprehensive performance of our approach exceeds the compared segmentation methods based on convolutional neural network (CNN) and ViT.
Computer Science,Environmental Science
What problem does this paper attempt to address?