DESformer: A Dual-branch Encoding Strategy for Semantic Segmentation of Very-High-Resolution Remote Sensing Images Based on Feature Interaction and Multi-Scale Context Fusion

Wenshu Liu,Nan Cui,Luo Guo,Shihong Du,Weiyin Wang
DOI: https://doi.org/10.1109/tgrs.2024.3446628
IF: 8.2
2024-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Global contextual information is crucial for the semantic segmentation of remote sensing (RS) images. However, the majority of current approaches depend on convolutional neural networks (CNNs). Due to the local receptive fields inherent in convolutional operations, these networks typically capture image features within limited areas and struggle to comprehend broader contextual information in the images. In this study, a dual-branch encoding approach, DESformer, is proposed, integrating transformers with CNN, to effectively capture global multiscale context information and enhance edge feature extraction. In addition, DESformer incorporates a feature interaction module (FIM) to combine local features with global representations extracted by transformers and CNN, respectively, across different resolutions. This approach enhances the capability to capture local features in RS images and improves the understanding of extensive spatial relationships. Subsequently, we employ a novel top-down approach for global supervision of the traditional feature pyramid multilevel visual integration (MVI) module, by harnessing the clear visual center information obtained from the deepest internal features. To successfully concentrate on important information and preserve sensitivity to features at various scales, the preceding shallow features are muted. In addition, FIAB-Loss, a loss function is introduced, combining a focal loss with IOU and active boundary loss (ABL). This composite loss function strengthens the model's focus on challenging-to-distinguish categories. Extensive experiments conducted on three datasets, including the semantic segmentation of lakes in the Tibetan Plateau and the ISPRS's Vaihingen benchmark, validate the efficacy of the proposed method. The experimental results indicate that the network exhibits exceptional performance in processing VHR images and accurately extracting edge features.
What problem does this paper attempt to address?