Abstract:High-spatial-resolution urban buildings play a crucial role in urban planning, emergency response, and disaster management. However, challenges such as missing building contours due to occlusion problems (occlusion between buildings of different heights and buildings obscured by trees), uneven contour extraction due to mixing of building edges with other feature elements (roads, vehicles, and trees), and slow training speed in high-resolution image data hinder efficient and accurate building extraction. To address these issues, we propose a semantic segmentation model composed of a lightweight backbone, coordinate attention module, and pooling fusion module, which achieves lightweight building extraction and adaptive recovery of spatial contours. Comparative experiments were conducted on datasets featuring typical urban building instances in China and the Mapchallenge dataset, comparing our method with several classical and mainstream semantic segmentation algorithms. The results demonstrate the effectiveness of our approach, achieving excellent mean intersection over union (mIoU) and frames per second (FPS) scores on both datasets (China dataset: 85.11% and 110.67 FPS; Mapchallenge dataset: 90.27% and 117.68 FPS). Quantitative evaluations indicate that our model not only significantly improves computational speed but also ensures high accuracy in the extraction of urban buildings from high-resolution imagery. Specifically, on a typical urban building dataset from China, our model shows an accuracy improvement of 0.64% and a speed increase of 70.03 FPS compared to the baseline model. On the Mapchallenge dataset, our model achieves an accuracy improvement of 0.54% and a speed increase of 42.39 FPS compared to the baseline model. Our research indicates that lightweight networks show significant potential in urban building extraction tasks. In the future, the segmentation accuracy and prediction speed can be further balanced on the basis of adjusting the deep learning model or introducing remote sensing indices, which can be applied to research scenarios such as greenfield extraction or multi-class target extraction.

Bounding Boxes Are All We Need: Street View Image Classification via Context Encoding of Detected Buildings

Building instance classification using street view images

Extracting Buildings from Remote Sensing Images Using a Multitask Encoder-Decoder Network with Boundary Refinement

Context-Enhanced Detector For Building Detection From Remote Sensing Images

BOMSC-Net: Boundary Optimization and Multi-Scale Context Awareness Based Building Extraction From High-Resolution Remote Sensing Imagery

UB-FineNet: Urban Building Fine-grained Classification Network for Open-access Satellite Images

Building Extraction From High Spatial Resolution Remote Sensing Images of Complex Scenes by Combining Region-Line Feature Fusion and OCNN

BUILDING CLASSIFICATION OF VHR AIRBORNE STEREO IMAGES USING FULLY CONVOLUTIONAL NETWORKS AND FREE TRAINING SAMPLES

Building Usage Prediction in Complex Urban Scenes By Fusing Text and Facade Features from Street View Images Using Deep Learning

Context Encoding for Semantic Segmentation

A natural language processing-based approach: mapping human perception by understanding deep semantic features in street view images

Attention-Gate-Based Encoder–Decoder Network for Automatical Building Extraction

Automatic Building Extraction from Google Earth Images under Complex Backgrounds Based on Deep Instance Segmentation Network

Zero-shot Building Attribute Extraction from Large-Scale Vision and Language Models

Fine-Grained Building Function Recognition from Street-View Images via Geometry-Aware Semi-Supervised Learning

CBF-Net: An Adaptive Context Balancing and Feature Filtering Network for Point Cloud Classification

Using Social Media Images for Building Function Classification

A Lightweight Building Extraction Approach for Contour Recovery in Complex Urban Environments

Building Detection in High-Resolution Remote Sensing Images by Enhancing Superpixel Segmentation and Classification Using Deep Learning Approaches

Scene Classification in Indoor Environments for Robots using Context Based Word Embeddings