DST3D: DLA-Swin Transformer for Single-Stage Monocular 3D Object Detection

Zhihong Wu,Xin Jiang,Ruidong Xu,Ke Lu,Yuan Zhu,Mingzhi Wu
DOI: https://doi.org/10.1109/IV51971.2022.9827462
2022-01-01
Abstract:Monocular 3D object detection is an essential task for infrastructure-less autonomous navigation and driving due to its low cost. Most previous state-of-the-art monocular 3D object detection methods depended on Convolutional Neural Networks (CNNs). We show that this reliance on CNNs is not necessary and a Transformer-based method can also perform very well. In this paper, we present the development of a new and general framework DST3D to predict a 3D bounding box for each object based on DLA-Swin Transformer (DST) in an end-to-end fashion without any pre-trained network for depth estimation. We propose an object-scale adaptive Gaussian Kernel for generating ground truth keypoint heatmap, which associate the keypoint of the object with its size and help to enhance the network’s performance. In addition, while regressing 3D variables, we introduce a double predictions dropout loss, which significantly improves both training consistence and detection accuracy. All of these make our framework simple yet efficient. Compared with all state-of-the-art CNNs-based methods, our proposed DST3D network achieves comparative performance on challenging KITTI benchmark with a faster speed, giving the top results on both 3D object detection and Bird’s Eye View evaluation.
What problem does this paper attempt to address?