Abstract:Road intersection monitoring and control research often utilize bird's eye view (BEV) simulators. In real traffic settings, achieving a BEV akin to that in a simulator necessitates the deployment of drones or specific sensor mounting, which is neither feasible nor practical. Consequently, traffic intersection management remains confined to simulation environments given these constraints. In this paper, we address the gap between simulated environments and real-world implementation by introducing a novel deep-learning model that converts a single camera's perspective of a road intersection into a BEV. We created a simulation environment that closely resembles a real-world traffic junction. The proposed model transforms the vehicles into BEV images, facilitating road intersection monitoring and control model processing. Inspired by image transformation techniques, we propose a Spatial-Transformer Double Decoder-UNet (SDD-UNet) model that aims to eliminate the transformed image distortions. In addition, the model accurately estimates the vehicle's positions and enables the direct application of simulation-trained models in real-world contexts. SDD-UNet model achieves an average dice similarity coefficient (DSC) above 95% which is 40% better than the original UNet model. The mean absolute error (MAE) is 0.102 and the centroid of the predicted mask is 0.14 meters displaced, on average, indicating high accuracy.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to convert the view of a road intersection captured by a single camera into a bird - eye view (BEV) for application in actual traffic monitoring and control. Specifically, the author aims to bridge the gap between the simulated environment and real - world implementation by introducing a new deep - learning model, making it possible to convert from a single - camera view to BEV. ### Problem Background 1. **Existing Challenges**: - In the actual traffic environment, obtaining a bird - eye view usually requires the use of drones or specific sensor installation methods, which are both impractical and inefficient. - Most of the existing traffic management and monitoring research relies on bird - eye views in the simulated environment and is difficult to be directly applied to the real world. 2. **Objectives**: - Propose a method that can convert the road - intersection images captured by a single camera into a bird - eye view, thereby achieving more effective traffic monitoring and management. - Make the model trained based on the simulated environment directly applicable to the actual scenario, improving the practicality and accuracy of the model. ### Solution The author proposes a new model named Spatial - Transformer Double Decoder - UNet (SDD - UNet), which can: - **Change the Viewpoint**: Convert the view of a single camera into a bird - eye view. - **Eliminate Distortion**: Reduce the image distortion generated during the conversion process and ensure the accurate estimation of vehicle positions. - **High Precision**: The model has achieved an average Dice Similarity Coefficient (DSC) of over 95% on the test set, a Mean Squared Error (MAE) of 0.102, and the average offset between the centroid of the predicted mask and the true value is only 0.14 meters. ### Method Overview 1. **Data Collection**: - Use a 3D simulation environment to generate training data, including road - intersection images taken from different angles and their corresponding bird - eye views. 2. **Model Architecture**: - SDD - UNet consists of an encoder and two decoder branches. The first decoder branch is responsible for detecting vehicles and determining their positions, and the second decoder branch uses the Spatial Transformer for viewpoint conversion. 3. **Loss Function**: - Use the negative Dice Similarity Coefficient as the loss function to maximize the segmentation performance of the model. 4. **Evaluation Metrics**: - Use the Dice Similarity Coefficient, Mean Squared Error (MAE), and the pixel distance between the centroid of the predicted mask and the true value to evaluate the model performance. ### Results The experimental results show that the SDD - UNet model significantly outperforms the traditional UNet and other improved versions, and it performs excellently in handling the conversion of single - camera road - intersection images into bird - eye views, with high accuracy and robustness. Through these innovations, this paper has successfully solved the key technical problem of converting a single - camera view into a bird - eye view, providing a new solution for actual traffic management and monitoring.

Camera Perspective Transformation to Bird's Eye View via Spatial Transformer Model for Road Intersection Monitoring

A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird's Eye View

Deep Perspective Transformation Based Vehicle Localization on Bird's Eye View

Predicting Maps Using In-Vehicle Cameras for Data-Driven Intelligent Transport

FedBEVT: Federated Learning Bird's Eye View Perception Transformer in Road Traffic Systems

RoadBEV: Road Surface Reconstruction in Bird's Eye View

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

Understanding Bird's-Eye View of Road Semantics using an Onboard Camera

WidthFormer: Toward Efficient Transformer-based BEV View Transformation

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Monocular BEV Perception of Road Scenes Via Front-to-Top View Projection

Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View

Monocular Plan View Networks for Autonomous Driving

Improved Single Camera BEV Perception Using Multi-Camera Training

Multi-View Fusion of Sensor Data for Improved Perception and Prediction in Autonomous Driving

UAP-BEV: Uncertainty Aware Planning using Bird's Eye View generated from Surround Monocular Images

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

A Dual-Cycled Cross-View Transformer Network for Unified Road Layout Estimation and 3D Object Detection in the Bird's-Eye-View

DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

UniDrive: Towards Universal Driving Perception Across Camera Configurations