AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection

Zongdai Liu,Dingfu Zhou,Feixiang Lu,Jin Fang,Liangjun Zhang
DOI: https://doi.org/10.48550/arXiv.2108.11127
2021-08-25
Abstract:Existing deep learning-based approaches for monocular 3D object detection in autonomous driving often model the object as a rotated 3D cuboid while the object's geometric shape has been ignored. In this work, we propose an approach for incorporating the shape-aware 2D/3D constraints into the 3D detection framework. Specifically, we employ the deep neural network to learn distinguished 2D keypoints in the 2D image domain and regress their corresponding 3D coordinates in the local 3D object coordinate first. Then the 2D/3D geometric constraints are built by these correspondences for each object to boost the detection performance. For generating the ground truth of 2D/3D keypoints, an automatic model-fitting approach has been proposed by fitting the deformed 3D object model and the object mask in the 2D image. The proposed framework has been verified on the public KITTI dataset and the experimental results demonstrate that by using additional geometrical constraints the detection performance has been significantly improved as compared to the baseline method. More importantly, the proposed framework achieves state-of-the-art performance with real time. Data and code will be available at <a class="link-external link-https" href="https://github.com/zongdai/AutoShape" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the performance of 3D object detection based on monocular cameras in the autonomous driving scenario. Specifically, the existing monocular 3D object detection methods based on deep learning usually model the object as a rotated 3D cuboid while ignoring the geometric shape of the object. This leads to limitations in detection performance. For this reason, this paper proposes a new method to improve the detection performance by introducing shape - aware 2D/3D constraints in the detection framework. Specific contributions include: 1. **Shape - aware 3D object detection framework**: This framework utilizes the geometric constraints of key points for 2D/3D regression to enhance detection performance. 2. **Automatic model fitting method**: A method for automatically fitting 3D models is proposed. By aligning the deformed 3D object model with the object mask in the 2D image, the ground - truth annotations of 2D/3D key points are generated. 3. **Real - time performance**: The effectiveness of this method has been verified on the public KITTI dataset, and it has achieved real - time performance (25 fps) and can be integrated into the perception module of autonomous driving. ### Specific problem description - **Challenges in obtaining depth information**: The main challenge faced by methods based on monocular cameras is how to obtain accurate depth information. Estimating depth information from a single image is a challenging problem, especially in the absence of prior information. - **Limitations of existing methods**: - **Depth map method**: Although the pseudo - LiDAR point cloud can be reconstructed through the estimated depth map, this method has a heavy computational burden and requires two - stage processing. - **Direct regression method**: Methods such as SMOKE and RTM3D directly regress the 3D information of the object. Although they are efficient, they ignore the detailed shape of the object, resulting in ambiguous positions. - **CAD model method**: Although it utilizes shape information, it is very difficult to manually label 3D shapes and the quality is difficult to guarantee. ### Solutions - **Shape - aware key point constraints**: By defining multiple prominent key points on the 3D model and learning the projection positions of these key points in the 2D image, 2D/3D geometric constraints are constructed. - **Automatic annotation pipeline**: An automatic annotation pipeline has been developed. By optimizing the 2D and 3D reprojection errors, the ground - truth annotations of 2D key points and 3D positions are automatically generated. - **End - to - end training**: The entire framework can be implemented in a neural network and trained end - to - end, which improves the robustness and detection performance of the model. ### Experimental results - **Performance on the KITTI dataset**: The experimental results show that after using additional geometric constraints, the detection performance is significantly improved, especially on data of medium and easy difficulty. - **Real - time performance**: This method has achieved a real - time performance of 25 fps, which is suitable for practical autonomous driving applications. Through these improvements, the AutoShape method proposed in this paper has achieved significant performance improvement in the monocular 3D object detection task and has real - time processing capabilities.