Abstract:Building extraction is significant in urban planning, economic evaluation, and driverless technology development. However, automatic building extraction from high spatial resolution remote sensing images has been a challenging task due to the various building shapes and colors, imaging conditions, and complex background objects. Current methods in building extraction are generally based on deep convolution networks, and they mostly use an encoder-decoder architecture, wherein detailed building features and small buildings are easily omitted in continuous convolution operations. Moreover, buildings with blurred boundaries are only completely extracted with difficulty. To meet these challenges, we propose a multi-task architecture of frequency-spatial learning Transformer to extract buildings from high spatial resolution remote sensing images. Different from current architecture, we designed a frequency-spatial learning module in the framework of multi-task to synthesize the multi-scale spatial features and frequency decomposition features of high-resolution image. Spiking convolution is proposed in this study to enhance the frequency features of buildings by mimicking the neural transmission in human brains. In this way, multi-scale building features can be better preserved and distinguished from background objects. Moreover, a masked-attention Transformer is adopted to improve multi-scale building mask prediction accuracy by synthesizing successive pixel-wise up-sampled feature maps. We also propose a strategy to evaluate the practical transferability of the proposed method by mimicking practical application cases through training and evaluating images with different spatial resolutions from different study areas and datasets. Experiments using five public building datasets (WHU-Building Satellite Dataset I, WHU-Building Satellite Dataset II, Massachusetts Buildings Dataset, Inria Aerial Image Dataset, xBD Building Dataset) demonstrate the strong potential applicability of our proposed method for practical application cases. Our method outperforms five recently proposed state-of-the-art semantic segmentation methods with 36.60% accuracy improvement on extracted buildings and approximately 53.55% recall progress in extracting small building instances. The implementation code will be released after the paper is published.

STransU2Net: Transformer based hybrid model for building segmentation in detailed satellite imagery

Asymmetric Network Combining CNN and Transformer for Building Extraction from Remote Sensing Images

Cross-level and multiscale CNN-Transformer network for automatic building extraction from remote sensing imagery

A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction

Multiscale Feature Learning by Transformer for Building Extraction From Satellite Images

MSFTrans: a multi-task frequency-spatial learning transformer for building extraction from high spatial resolution remote sensing images

Hybrid transformer-CNN networks using superpixel segmentation for remote sensing building change detection

SDSC-UNet: Dual Skip Connection ViT-Based U-Shaped Model for Building Extraction

LiteST-Net: A Hybrid Model of Lite Swin Transformer and Convolution for Building Extraction from Remote Sensing Image

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

DSAT-Net: Dual Spatial Attention Transformer for Building Extraction From Aerial Images

A scale robust convolutional neural network for automatic building extraction from aerial and satellite imagery

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

UAVformer: A Composite Transformer Network for Urban Scene Segmentation of UAV Images

CD-TransUNet: A Hybrid Transformer Network for the Change Detection of Urban Buildings Using L-Band SAR Images

Transformer-based semantic segmentation for large-scale building footprint extraction from very-high resolution satellite images

HA U-Net: Improved Model for Building Extraction From High Resolution Remote Sensing Imagery

Building Extraction With Vision Transformer

C1 dissociation. Spontaneous generation in human serum of a trimer complex containing C1 inactivator, activated C1r, and zymogen C1s.

STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation

Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images