SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation

Min Yao,Yaozu Zhang,Guofeng Liu,Dongdong Pang

DOI: https://doi.org/10.1109/jstars.2024.3349657

IF: 4.715

2024-02-02

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

Abstract:There are still various challenges in remote sensing semantic segmentation due to objects diversity and complexity. Transformer-based models have significant advantages in capturing global feature dependencies for segmentation. However, it unfortunately ignores local feature details. On the other hand, convolutional neural network (CNN), with a different interaction mechanism from transformer-based models, captures more small-scale local features instead of global features. In this article, a new semantic segmentation net framework named SSNet is proposed, which incorporates an encoder–decoder structure, optimizing the advantages of both local and global features. In addition, we build feature fuse module and feature inject module to largely fuse these two-style features. The former module captures the dependencies between different positions and channels to extract multiscale features, which promotes the segmentation precision on similar objects. The latter module condenses the global information in transformer and injects it into CNN to obtain a broad global field of view, in which the depthwise strip convolution improves the segmentation accuracy on tiny objects. A CNN-based decoder progressively recovers the feature map size, and a block called atrous spatial pyramid pooling is adopted in decoder to obtain a multiscale context. The skip connection is established between the decoder and the encoder, which retains important feature information of the shallow layer network and is conducive to achieving flow of multiscale features. To evaluate our model, we compare it with current state-of-the-art models on WHDLD and Potsdam datasets. The experimental results indicate that our proposed model achieves more precise semantic segmentation.

imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geography, physical

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address various challenges in semantic segmentation of remote sensing images, specifically including: 1. **Background Complexity**: Complex backgrounds can easily interfere with the recognition of small objects. 2. **Highly Similar Objects**: Objects with highly similar shapes, colors, and textures are difficult to distinguish. 3. **Tiny Objects in High-Resolution Images**: Tiny objects are hard to identify in high-resolution images. Currently, Transformer-based models have significant advantages in capturing global feature dependencies but overlook local feature details. On the other hand, Convolutional Neural Networks (CNNs), although not as adept as Transformer models in capturing global features, perform better in capturing local small-scale features. Therefore, this paper proposes a new semantic segmentation network framework, SSNet, which combines an encoder-decoder structure, optimizing the advantages of both global and local features. It achieves effective fusion of these two types of features through the Feature Fusion Module (FFM) and Feature Injection Module (FIM). Experimental results show that SSNet achieves more accurate semantic segmentation on the WHDLD and Potsdam datasets.

SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images

ACTNet: A Dual-Attention Adapter with a CNN-Transformer Network for the Semantic Segmentation of Remote Sensing Imagery

TMNet: A Two-Branch Multi-Scale Semantic Segmentation Network for Remote Sensing Images

PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation

CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation

UNeXt: An Efficient Network for the Semantic Segmentation of High-Resolution Remote Sensing Images

Cascaded CNN and global–local attention transformer network-based semantic segmentation for high-resolution remote sensing image

LACTNet: A Lightweight Real-Time Semantic Segmentation Network Based on an Aggregated Convolutional Neural Network and Transformer

TCUNet: A Lightweight Dual-Branch Parallel Network for Sea-Land Segmentation in Remote Sensing Images

Multiscale Global Context Network for Semantic Segmentation of High-Resolution Remote Sensing Images

A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images

Encoder- and Decoder-Based Networks Using Multiscale Feature Fusion and Nonlocal Block for Remote Sensing Image Semantic Segmentation

Category attention guided network for semantic segmentation of Fine-Resolution remote sensing images

DSHNet: A Semantic Segmentation Model of Remote Sensing Images Based on Dual Stream Hybrid Network

Local-enhanced multi-scale aggregation swin transformer for semantic segmentation of high-resolution remote sensing images

FSegNet: A Semantic Segmentation Network for High-Resolution Remote Sensing Images That Balances Efficiency and Performance