Abstract:Building extraction is significant in urban planning, economic evaluation, and driverless technology development. However, automatic building extraction from high spatial resolution remote sensing images has been a challenging task due to the various building shapes and colors, imaging conditions, and complex background objects. Current methods in building extraction are generally based on deep convolution networks, and they mostly use an encoder-decoder architecture, wherein detailed building features and small buildings are easily omitted in continuous convolution operations. Moreover, buildings with blurred boundaries are only completely extracted with difficulty. To meet these challenges, we propose a multi-task architecture of frequency-spatial learning Transformer to extract buildings from high spatial resolution remote sensing images. Different from current architecture, we designed a frequency-spatial learning module in the framework of multi-task to synthesize the multi-scale spatial features and frequency decomposition features of high-resolution image. Spiking convolution is proposed in this study to enhance the frequency features of buildings by mimicking the neural transmission in human brains. In this way, multi-scale building features can be better preserved and distinguished from background objects. Moreover, a masked-attention Transformer is adopted to improve multi-scale building mask prediction accuracy by synthesizing successive pixel-wise up-sampled feature maps. We also propose a strategy to evaluate the practical transferability of the proposed method by mimicking practical application cases through training and evaluating images with different spatial resolutions from different study areas and datasets. Experiments using five public building datasets (WHU-Building Satellite Dataset I, WHU-Building Satellite Dataset II, Massachusetts Buildings Dataset, Inria Aerial Image Dataset, xBD Building Dataset) demonstrate the strong potential applicability of our proposed method for practical application cases. Our method outperforms five recently proposed state-of-the-art semantic segmentation methods with 36.60% accuracy improvement on extracted buildings and approximately 53.55% recall progress in extracting small building instances. The implementation code will be released after the paper is published.

Building Extraction of Aerial Images by a Global and Multi-Scale Encoder-Decoder Network

Extracting Buildings from Remote Sensing Images Using a Multitask Encoder-Decoder Network with Boundary Refinement

Attention-Gate-Based Encoder–Decoder Network for Automatical Building Extraction

Building Extraction From High Spatial Resolution Remote Sensing Images of Complex Scenes by Combining Region-Line Feature Fusion and OCNN

Multi-Scale Feature Fusion Attention Network for Building Extraction in Remote Sensing Images

Local–Global Multiscale Fusion Network for Semantic Segmentation of Buildings in SAR Imagery

A Building Extraction Method for High-Resolution Remote Sensing Images with Multiple Attentions and Parallel Encoders Combining Enhanced Spectral Information

SSDBN: A Single-Side Dual-Branch Network with Encoder–Decoder for Building Extraction

Building Multi-Feature Fusion Refined Network for Building Extraction from High-Resolution Remote Sensing Images

Architecture of Deep Convolutional Encoder-Decoder Networks for Building Footprint Semantic Segmentation

Hierarchical Disentangling Network for Building Extraction from Very High Resolution Optical Remote Sensing Imagery

MSFTrans: a multi-task frequency-spatial learning transformer for building extraction from high spatial resolution remote sensing images

B-FGC-Net: A Building Extraction Network from High Resolution Remote Sensing Imagery

BOMSC-Net: Boundary Optimization and Multi-Scale Context Awareness Based Building Extraction From High-Resolution Remote Sensing Imagery

Capsule–Encoder–Decoder: A Method for Generalizable Building Extraction from Remote Sensing Images

MDCGA-Net: Multiscale Direction Context-Aware Network With Global Attention for Building Extraction From Remote Sensing Images

CSA-Net: Complex Scenarios Adaptive Network for Building Extraction for Remote Sensing Images

A strategy for the identification of proteins targeted by thioredoxin

Urban building extraction from high-resolution remote sensing imagery based on multi-scale recurrent conditional generative adversarial network

A scale robust convolutional neural network for automatic building extraction from aerial and satellite imagery

HCRB-MSAN: Horizontally Connected Residual Blocks-Based Multiscale Attention Network for Semantic Segmentation of Buildings in HSR Remote Sensing Images