SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation

Min Yao,Yaozu Zhang,Guofeng Liu,Dongdong Pang
DOI: https://doi.org/10.1109/jstars.2024.3349657
IF: 4.715
2024-02-02
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Abstract:There are still various challenges in remote sensing semantic segmentation due to objects diversity and complexity. Transformer-based models have significant advantages in capturing global feature dependencies for segmentation. However, it unfortunately ignores local feature details. On the other hand, convolutional neural network (CNN), with a different interaction mechanism from transformer-based models, captures more small-scale local features instead of global features. In this article, a new semantic segmentation net framework named SSNet is proposed, which incorporates an encoder–decoder structure, optimizing the advantages of both local and global features. In addition, we build feature fuse module and feature inject module to largely fuse these two-style features. The former module captures the dependencies between different positions and channels to extract multiscale features, which promotes the segmentation precision on similar objects. The latter module condenses the global information in transformer and injects it into CNN to obtain a broad global field of view, in which the depthwise strip convolution improves the segmentation accuracy on tiny objects. A CNN-based decoder progressively recovers the feature map size, and a block called atrous spatial pyramid pooling is adopted in decoder to obtain a multiscale context. The skip connection is established between the decoder and the encoder, which retains important feature information of the shallow layer network and is conducive to achieving flow of multiscale features. To evaluate our model, we compare it with current state-of-the-art models on WHDLD and Potsdam datasets. The experimental results indicate that our proposed model achieves more precise semantic segmentation.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geography, physical
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address various challenges in semantic segmentation of remote sensing images, specifically including: 1. **Background Complexity**: Complex backgrounds can easily interfere with the recognition of small objects. 2. **Highly Similar Objects**: Objects with highly similar shapes, colors, and textures are difficult to distinguish. 3. **Tiny Objects in High-Resolution Images**: Tiny objects are hard to identify in high-resolution images. Currently, Transformer-based models have significant advantages in capturing global feature dependencies but overlook local feature details. On the other hand, Convolutional Neural Networks (CNNs), although not as adept as Transformer models in capturing global features, perform better in capturing local small-scale features. Therefore, this paper proposes a new semantic segmentation network framework, SSNet, which combines an encoder-decoder structure, optimizing the advantages of both global and local features. It achieves effective fusion of these two types of features through the Feature Fusion Module (FFM) and Feature Injection Module (FIM). Experimental results show that SSNet achieves more accurate semantic segmentation on the WHDLD and Potsdam datasets.