CTCFNet: CNN-Transformer Complementary and Fusion Network for High-Resolution Remote Sensing Image Semantic Segmentation

Chen Lu,Xian Zhang,Kaile Du,Han Xu,Guangcan Liu
DOI: https://doi.org/10.1109/tgrs.2024.3458446
2024-01-01
Abstract:Semantic segmentation of high-resolution remote sensing images poses challenges such as scale variability, diverse objects, and obstruction by surface elements. These factors often lead existing methods to suffer from issues like missed and false detections, as well as coarse segmentation boundaries. To tackle these challenges, this article proposes a CNN-transformer complementary and fusion network, termed as CTCFNet. It aims to enhance segmentation accuracy and robustness by extracting and integrating the complementary global and local information from high-resolution remote sensing images. The CTCFNet operates through two primary stages: feature extraction and fusion. In the feature extraction stage, a feature extractor employs convolutional neural network (CNN) and pyramid vision transformer (PVT) blocks to extract both local and global features. A boundary loss is also proposed to improve the segmentation performance for object textures and boundaries. In the feature fusion stage, a feature aggregation module (FAM) is first designed to effectively fuse local and global features at the same scale, facilitating the feature extractor to obtain more comprehensive representations. On this basis, a bi-directional decoder (BiDecoder) reconstructs multiscale features through both top-down and bottom-up directions, resulting in more precise segmentation outputs. Experiments on several high-resolution remote sensing image datasets demonstrate that the proposed method outperforms the state-of-the-art methods in terms of segmentation accuracy and generalization. The code is available at https://github.com/ChenLu0000/CTCFNet.
What problem does this paper attempt to address?