Camera Perspective Transformation to Bird's Eye View via Spatial Transformer Model for Road Intersection Monitoring

Rukesh Prajapati,Amr S. El-Wakeel
2024-08-14
Abstract:Road intersection monitoring and control research often utilize bird's eye view (BEV) simulators. In real traffic settings, achieving a BEV akin to that in a simulator necessitates the deployment of drones or specific sensor mounting, which is neither feasible nor practical. Consequently, traffic intersection management remains confined to simulation environments given these constraints. In this paper, we address the gap between simulated environments and real-world implementation by introducing a novel deep-learning model that converts a single camera's perspective of a road intersection into a BEV. We created a simulation environment that closely resembles a real-world traffic junction. The proposed model transforms the vehicles into BEV images, facilitating road intersection monitoring and control model processing. Inspired by image transformation techniques, we propose a Spatial-Transformer Double Decoder-UNet (SDD-UNet) model that aims to eliminate the transformed image distortions. In addition, the model accurately estimates the vehicle's positions and enables the direct application of simulation-trained models in real-world contexts. SDD-UNet model achieves an average dice similarity coefficient (DSC) above 95% which is 40% better than the original UNet model. The mean absolute error (MAE) is 0.102 and the centroid of the predicted mask is 0.14 meters displaced, on average, indicating high accuracy.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to convert the view of a road intersection captured by a single camera into a bird - eye view (BEV) for application in actual traffic monitoring and control. Specifically, the author aims to bridge the gap between the simulated environment and real - world implementation by introducing a new deep - learning model, making it possible to convert from a single - camera view to BEV. ### Problem Background 1. **Existing Challenges**: - In the actual traffic environment, obtaining a bird - eye view usually requires the use of drones or specific sensor installation methods, which are both impractical and inefficient. - Most of the existing traffic management and monitoring research relies on bird - eye views in the simulated environment and is difficult to be directly applied to the real world. 2. **Objectives**: - Propose a method that can convert the road - intersection images captured by a single camera into a bird - eye view, thereby achieving more effective traffic monitoring and management. - Make the model trained based on the simulated environment directly applicable to the actual scenario, improving the practicality and accuracy of the model. ### Solution The author proposes a new model named Spatial - Transformer Double Decoder - UNet (SDD - UNet), which can: - **Change the Viewpoint**: Convert the view of a single camera into a bird - eye view. - **Eliminate Distortion**: Reduce the image distortion generated during the conversion process and ensure the accurate estimation of vehicle positions. - **High Precision**: The model has achieved an average Dice Similarity Coefficient (DSC) of over 95% on the test set, a Mean Squared Error (MAE) of 0.102, and the average offset between the centroid of the predicted mask and the true value is only 0.14 meters. ### Method Overview 1. **Data Collection**: - Use a 3D simulation environment to generate training data, including road - intersection images taken from different angles and their corresponding bird - eye views. 2. **Model Architecture**: - SDD - UNet consists of an encoder and two decoder branches. The first decoder branch is responsible for detecting vehicles and determining their positions, and the second decoder branch uses the Spatial Transformer for viewpoint conversion. 3. **Loss Function**: - Use the negative Dice Similarity Coefficient as the loss function to maximize the segmentation performance of the model. 4. **Evaluation Metrics**: - Use the Dice Similarity Coefficient, Mean Squared Error (MAE), and the pixel distance between the centroid of the predicted mask and the true value to evaluate the model performance. ### Results The experimental results show that the SDD - UNet model significantly outperforms the traditional UNet and other improved versions, and it performs excellently in handling the conversion of single - camera road - intersection images into bird - eye views, with high accuracy and robustness. Through these innovations, this paper has successfully solved the key technical problem of converting a single - camera view into a bird - eye view, providing a new solution for actual traffic management and monitoring.