When zero-padding position encoding encounters linear space reduction attention: an efficient semantic segmentation Transformer of remote sensing images

Yi Yan,Jing Zhang,Xinjia Wu,Jiafeng Li,Li Zhuo
DOI: https://doi.org/10.1080/01431161.2023.2299276
IF: 3.531
2024-01-27
International Journal of Remote Sensing
Abstract:Semantic segmentation of remote sensing images (RSIs) is of great significance for obtaining geospatial object information. Transformers win promising effect, whereas multi-head self-attention (MSA) is expensive. We propose an efficient semantic segmentation Transformer (ESST) of RSIs that combines zero-padding position encoding with linear space reduction attention (LSRA). First, to capture the coarse-to-fine features of RSI, a zero-padding position encoding is proposed by adding overlapping patch embedding (OPE) layers and convolution feed-forward networks (CFFN) to improve the local continuity of features. Then, we replace LSRA in the attention operation to extract multi-level features to reduce the computational cost of the encoder. Finally, we design a lightweight all multi-layer perceptron (all-MLP) head decoder to easily aggregate multi-level features to generate multi-scale features for semantic segmentation. Experimental results demonstrate that our method produces a trade-off in accuracy and speed for semantic segmentation of RSIs on the Potsdam and Vaihingen datasets, respectively.
imaging science & photographic technology,remote sensing
What problem does this paper attempt to address?