Abstract:Building extraction is significant in urban planning, economic evaluation, and driverless technology development. However, automatic building extraction from high spatial resolution remote sensing images has been a challenging task due to the various building shapes and colors, imaging conditions, and complex background objects. Current methods in building extraction are generally based on deep convolution networks, and they mostly use an encoder-decoder architecture, wherein detailed building features and small buildings are easily omitted in continuous convolution operations. Moreover, buildings with blurred boundaries are only completely extracted with difficulty. To meet these challenges, we propose a multi-task architecture of frequency-spatial learning Transformer to extract buildings from high spatial resolution remote sensing images. Different from current architecture, we designed a frequency-spatial learning module in the framework of multi-task to synthesize the multi-scale spatial features and frequency decomposition features of high-resolution image. Spiking convolution is proposed in this study to enhance the frequency features of buildings by mimicking the neural transmission in human brains. In this way, multi-scale building features can be better preserved and distinguished from background objects. Moreover, a masked-attention Transformer is adopted to improve multi-scale building mask prediction accuracy by synthesizing successive pixel-wise up-sampled feature maps. We also propose a strategy to evaluate the practical transferability of the proposed method by mimicking practical application cases through training and evaluating images with different spatial resolutions from different study areas and datasets. Experiments using five public building datasets (WHU-Building Satellite Dataset I, WHU-Building Satellite Dataset II, Massachusetts Buildings Dataset, Inria Aerial Image Dataset, xBD Building Dataset) demonstrate the strong potential applicability of our proposed method for practical application cases. Our method outperforms five recently proposed state-of-the-art semantic segmentation methods with 36.60% accuracy improvement on extracted buildings and approximately 53.55% recall progress in extracting small building instances. The implementation code will be released after the paper is published.

Trident Cooperation Network for Building Extraction and Height Estimation

LIGHT: Joint Individual Building Extraction and Height Estimation from Satellite Images through a Unified Multitask Learning Network

HGDNet: A Height-Hierarchy Guided Dual-Decoder Network for Single View Building Extraction and Height Estimation

Asymmetric Network Combining CNN and Transformer for Building Extraction from Remote Sensing Images

HTC-DC Net: Monocular Height Estimation from Single Remote Sensing Images

Data Fusion for Multi-Task Learning of Building Extraction and Height Estimation

FusionHeightNet: A Multi-Level Cross-Fusion Method from Multi-Source Remote Sensing Images for Urban Building Height Estimation

MF-BHNet: A Hybrid Multimodal Fusion Network for Building Height Estimation Using Sentinel-1 and Sentinel-2 Imagery

MSFTrans: a multi-task frequency-spatial learning transformer for building extraction from high spatial resolution remote sensing images

Extracting Buildings from Remote Sensing Images Using a Multitask Encoder-Decoder Network with Boundary Refinement

A Dual-Branch Fusion Network Based on Reconstructed Transformer for Building Extraction in Remote Sensing Imagery

ACMFNet: Attention-Based Cross-Modal Fusion Network for Building Extraction of Remote Sensing Images

Multi-Modality Task Cascade for 3D Object Detection

TFIENet: Transformer Fusion Information Enhancement Network for Multi-Model 3D Object Detection

DSAT-Net: Dual Spatial Attention Transformer for Building Extraction From Aerial Images

Multi-Task Foreground-Aware Network with Depth Completion for Enhanced RGB-D Fusion Object Detection Based on Transformer

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

BCTNet: Bi-Branch Cross-Fusion Transformer for Building Footprint Extraction

Asymmetric Cascade Fusion Network for Building Extraction

B-FGC-Net: A Building Extraction Network from High Resolution Remote Sensing Imagery

TFIENet: Transformer Fusion Information Enhancement Network for Multimodel 3-D Object Detection.