Abstract:Both local context information and global context information are essential for the semantic segmentation of aerial images. Convolutional neural networks (CNNs) can capture local context information well but cannot model the global dependencies. Vision transformers (ViTs) are good at extracting global information but cannot retain spatial details well. In order to leverage the advantages of these two paradigms, we integrate them into one model in this study. However, global token interaction of ViT brings high computational cost, which makes it difficult to apply to large-sized aerial images. To handle this problem, we propose a novel efficient ViT block named long-short-range transformer (LSRFormer). Instead of mainstream ViTs designed as backbones, LSRFormer is a pretraining-free and plug-and-play module to be appended after CNN stages to supplement the global information. It is composed of long-range self-attention (LR-SA), short-range self-attention (SR-SA), and multiscale-convolutional feed-forward network (MSC-FFN). LR-SA establishes long-range dependencies at the junction of the windows and SR-SA diffuses the long-range information from window boundary to internal. MSC-FFN can capture multiscale information inside the ViT block. We append the LSRFormer block after each CNN stage of a pure convolutional network to build a model named ConvLSR-Net. Compared with existing models which combine CNN and ViTs, our model can learn both local and global representations at all stages of the model. In particular, ConvLSR-Net achieves state-of-the-art (SOTA) results on four challenging aerial image segmentation benchmarks, including iSAID, LoveDA, ISPRS Potsdam, and Vaihingen. The code has been released at https://github.com/stdcoutzrh/ConvLSR-Net.

Context and Apparent Features Aggregation Network for Semantic Segmentation

Attention-guided chained context aggregation for semantic segmentation

Attention Guided Global Enhancement and Local Refinement Network for Semantic Segmentation

Integrating Spatial Details with Long-Range Contexts for Semantic Segmentation of Very High-Resolution Remote-Sensing Images.

CAN: Contextual Aggregating Network for Semantic Segmentation.

Context Aggregation Network for Remote Sensing Image Semantic Segmentation

Context Aggregation Network For Semantic Labeling In Aerial Images

Cross Aggregation Network for Semantic Segmentation

HCNet: Hierarchical Context Network for Semantic Segmentation

LACTNet: A Lightweight Real-time Semantic Segmentation Network Based on Aggregation CNN and Transformer

LACTNet: A Lightweight Real-Time Semantic Segmentation Network Based on an Aggregated Convolutional Neural Network and Transformer

DCANet: Dense Context-Aware Network for Semantic Segmentation

Long and short-range relevance context network for semantic segmentation

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

LSRFormer: Efficient Transformer Supply Convolutional Neural Networks With Global Information for Aerial Image Segmentation

Multiscale Global Context Network for Semantic Segmentation of High-Resolution Remote Sensing Images

SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation

Compensating for Local Ambiguity With Encoder-Decoder in Urban Scene Segmentation

Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images

CCNet: Criss-Cross Attention for Semantic Segmentation