Abstract:In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0\%, 2.2\%, and 7.6\% in mAP, NDS, and AMOTA, achieving 46.9\%, 56.1\%, and 49.0\%, respectively. Our best model achieved 71.9\% NDS and 67.7\% AMOTA on the nuScenes test set. Code will be released at \url{<a class="link-external link-https" href="https://github.com/linxuewu/Sparse4D" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the performance improvement of 3D detection and tracking tasks in the autonomous driving perception system. Specifically, the author has made improvements on the basis of the Sparse4D framework and proposed the following several innovations: 1. **Introduce two auxiliary training tasks**: - **Temporal Instance Denoising**: By adding noisy instances and performing denoising processing, it ensures stable matching of positive samples, increases the number of positive samples, thereby improving model convergence and detection performance. - **Quality Estimation**: Introduce centerness and yawness as quality metrics, enabling the network to better understand the quality of prediction boxes, accelerating convergence and optimizing the ranking of prediction results. 2. **Decoupled Attention**: Make structural improvements to the self - attention module and the temporal cross - attention module, using feature splicing instead of addition combination, reducing feature interference and improving the accuracy of attention weight calculation. 3. **Extend to Multi - Object Tracking**: Expand the Sparse4D framework into an end - to - end tracking model, and output the object motion trajectory by directly assigning instance IDs during the inference process, without the need for additional data association and filtering steps. These improvements have significantly enhanced the performance of the Sparse4D framework on the nuScenes benchmark test, especially achieving significant improvements in key indicators such as mAP, NDS, and AMOTA. The specific values are as follows: - Using ResNet50 as the backbone network, on the nuScenes validation set, mAP, NDS, and AMOTA are increased by 3.0%, 2.2%, and 7.6% respectively, reaching 46.9%, 56.1%, and 49.0%. - The best model reaches 71.9% NDS and 67.7% AMOTA on the nuScenes test set. Through these improvements, Sparse4Dv3 not only achieves higher performance in detection and tracking tasks, but also shows its potential in practical applications.

Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Object tracking with 3D LIDAR via multi-task sparse learning

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

SSF: Sparse Point Cloud Object Detection Based on Self-Adaptive Voxel Encoding and Focal-Sparse Convolution

SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-based 3D Object Detection

SRCN3D: Sparse R-CNN 3D Surround-View Camera Object Detection and Tracking for Autonomous Driving

SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-Based 3-D Object Detection

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Real-Time 3D Object Detection From Point Cloud Through Foreground Segmentation

SRCN3D: Sparse R-CNN 3D for Compact Convolutional Multi-View 3D Object Detection and Tracking

Super Sparse 3D Object Detection

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

Fully Sparse Fusion for 3D Object Detection

SparseDet: Towards End-to-End 3D Object Detection

SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection

SparsePoint: Fully End-to-End Sparse 3D Object Detector

3D Multi-object Detection and Tracking with Sparse Stationary LiDAR

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Sparse2Dense: Learning to Densify 3D Features for 3D Object Detection

Monocular Quasi-Dense 3D Object Tracking