Abstract:Abstract. The urban road network detection and extraction have significant applications in many domains, such as intelligent transportation and navigation, urban planning, and automatic driving. Although manual annotation methods can provide accurate road network maps, their low efficiency with high-cost consumption are insufficient for the current tasks. Traditional methods based on spectral or geometric information rely on shallow features and often struggle with low semantic segmentation accuracy in complex remote sensing backgrounds. In recent years, deep convolutional neural networks (CNN) have provided robust feature representations to distinguish complex terrain objects. However, these CNNs ignore the fusion of global-local contexts and are often confused with other types of features, especially buildings. In addition, conventional convolution operations use a fixed template paradigm to aggregate local feature information. The road features present complex linear-shape geometric relationships, which brings some obstacles to feature construction. To address the above issues, we proposed a hybrid network structure that combines the advantages of CNN and transformer models. Specifically, a multiscale deformable convolution module has been developed to capture local road context information adaptively. The Transformer model is introduced into the encoder to enhance semantic information to build the global context. Meanwhile, the CNN features are fused with the transformer features. Finally, the model outputs a road extraction prediction map in high spatial resolution. Quantitative analysis and visual expression confirm that the proposed model can effectively and automatically extract road features from complex remote sensing backgrounds, outperforming state-of-the-art methods with IOU by 86.5% and OA by 97.4%.

A Dual-Cycled Cross-View Transformer Network for Unified Road Layout Estimation and 3D Object Detection in the Bird's-Eye-View

Transformer Union Convolution Network for Visual Object Tracking

An Efficient Transformer for Simultaneous Learning of BEV and Lane Representations in 3D Lane Detection

DualBEV: CNN is All You Need in View Transformation

DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

EVT: Efficient View Transformation for Multi-Modal 3D Object Detection

DCTNET: HYBRID NETWORK MODEL FUSING WITH MULTISCALE DEFORMABLE CNN AND TRANSFORMER STRUCTURE FOR ROAD EXTRACTION FROM GAOFEN SATELLITE REMOTE SENSING IMAGE

Camera Perspective Transformation to Bird's Eye View via Spatial Transformer Model for Road Intersection Monitoring

BDTNet: Road Extraction by Bi-Direction Transformer From Remote Sensing Images

L2T-BEV: Local Lane Topology Prediction from Onboard Surround-View Cameras in Bird's Eye View Perspective.

WidthFormer: Toward Efficient Transformer-based BEV View Transformation

A duplex transform heterogeneous feature fusion network for road segmentation

BEV-CFKT: A LiDAR-camera cross-modality-interaction fusion and knowledge transfer framework with transformer for BEV 3D object detection

DDCTNet: A Deformable and Dynamic Cross-Transformer Network for Road Extraction From High-Resolution Remote Sensing Images

An End-to-End Multi-Task Learning Model for Drivable Road Detection via Edge Refinement and Geometric Deformation

FedBEVT: Federated Learning Bird's Eye View Perception Transformer in Road Traffic Systems

VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection

A dual-level graph attention network and transformer for enhanced trajectory prediction under road network constraints

RoadCT: A Hybrid CNN-Transformer Network for Road Extraction From Satellite Imagery

TEDNet: Twin Encoder Decoder Neural Network for 2D Camera and LiDAR Road Detection