Abstract:Building extraction is significant in urban planning, economic evaluation, and driverless technology development. However, automatic building extraction from high spatial resolution remote sensing images has been a challenging task due to the various building shapes and colors, imaging conditions, and complex background objects. Current methods in building extraction are generally based on deep convolution networks, and they mostly use an encoder-decoder architecture, wherein detailed building features and small buildings are easily omitted in continuous convolution operations. Moreover, buildings with blurred boundaries are only completely extracted with difficulty. To meet these challenges, we propose a multi-task architecture of frequency-spatial learning Transformer to extract buildings from high spatial resolution remote sensing images. Different from current architecture, we designed a frequency-spatial learning module in the framework of multi-task to synthesize the multi-scale spatial features and frequency decomposition features of high-resolution image. Spiking convolution is proposed in this study to enhance the frequency features of buildings by mimicking the neural transmission in human brains. In this way, multi-scale building features can be better preserved and distinguished from background objects. Moreover, a masked-attention Transformer is adopted to improve multi-scale building mask prediction accuracy by synthesizing successive pixel-wise up-sampled feature maps. We also propose a strategy to evaluate the practical transferability of the proposed method by mimicking practical application cases through training and evaluating images with different spatial resolutions from different study areas and datasets. Experiments using five public building datasets (WHU-Building Satellite Dataset I, WHU-Building Satellite Dataset II, Massachusetts Buildings Dataset, Inria Aerial Image Dataset, xBD Building Dataset) demonstrate the strong potential applicability of our proposed method for practical application cases. Our method outperforms five recently proposed state-of-the-art semantic segmentation methods with 36.60% accuracy improvement on extracted buildings and approximately 53.55% recall progress in extracting small building instances. The implementation code will be released after the paper is published.

A Strong Vision Transformer Adapter with Adaptive Thresholding for Fine-Grained Building Classification

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Fine-grained building roof instance segmentation based on domain adapted pretraining and composite dual-backbone

Trident Cooperation Network for Building Extraction and Height Estimation

Hierarchical Feature Fusion of Transformer with Patch Dilating for Remote Sensing Scene Classification

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion

FET-FGVC: Feature-enhanced transformer for fine-grained visual classification

Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

Dual Transformer with Multi-Grained Assembly for Fine-Grained Visual Classification

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

TfNet: Building Detection in Remote Sensing Images Using Multi-Scale Feature Fusion

MSFTrans: a multi-task frequency-spatial learning transformer for building extraction from high spatial resolution remote sensing images

Fusion-Former: Fusion Features Across Transformer and Convolution for Building Change Detection

An Adaptive Attention Fusion Mechanism Convolutional Network for Object Detection in Remote Sensing Images

Building Extraction With Vision Transformer

A Vision Enhancement and Feature Fusion Multiscale Detection Network

TransFG: A Transformer Architecture for Fine-Grained Recognition