Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Xuewu Lin,Zixiang Pei,Tianwei Lin,Lichao Huang,Zhizhong Su
2023-11-20
Abstract:In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0\%, 2.2\%, and 7.6\% in mAP, NDS, and AMOTA, achieving 46.9\%, 56.1\%, and 49.0\%, respectively. Our best model achieved 71.9\% NDS and 67.7\% AMOTA on the nuScenes test set. Code will be released at \url{<a class="link-external link-https" href="https://github.com/linxuewu/Sparse4D" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance improvement of 3D detection and tracking tasks in the autonomous driving perception system. Specifically, the author has made improvements on the basis of the Sparse4D framework and proposed the following several innovations: 1. **Introduce two auxiliary training tasks**: - **Temporal Instance Denoising**: By adding noisy instances and performing denoising processing, it ensures stable matching of positive samples, increases the number of positive samples, thereby improving model convergence and detection performance. - **Quality Estimation**: Introduce centerness and yawness as quality metrics, enabling the network to better understand the quality of prediction boxes, accelerating convergence and optimizing the ranking of prediction results. 2. **Decoupled Attention**: Make structural improvements to the self - attention module and the temporal cross - attention module, using feature splicing instead of addition combination, reducing feature interference and improving the accuracy of attention weight calculation. 3. **Extend to Multi - Object Tracking**: Expand the Sparse4D framework into an end - to - end tracking model, and output the object motion trajectory by directly assigning instance IDs during the inference process, without the need for additional data association and filtering steps. These improvements have significantly enhanced the performance of the Sparse4D framework on the nuScenes benchmark test, especially achieving significant improvements in key indicators such as mAP, NDS, and AMOTA. The specific values are as follows: - Using ResNet50 as the backbone network, on the nuScenes validation set, mAP, NDS, and AMOTA are increased by 3.0%, 2.2%, and 7.6% respectively, reaching 46.9%, 56.1%, and 49.0%. - The best model reaches 71.9% NDS and 67.7% AMOTA on the nuScenes test set. Through these improvements, Sparse4Dv3 not only achieves higher performance in detection and tracking tasks, but also shows its potential in practical applications.