John Willes,Cody Reading,Steven L. Waslander
Abstract:3D multi-object tracking (MOT) is a key problem for autonomous vehicles, required to perform well-informed motion planning in dynamic environments. Particularly for densely occupied scenes, associating existing tracks to new detections remains challenging as existing systems tend to omit critical contextual information. Our proposed solution, InterTrack, introduces the Interaction Transformer for 3D MOT to generate discriminative object representations for data association. We extract state and shape features for each track and detection, and efficiently aggregate global information via attention. We then perform a learned regression on each track/detection feature pair to estimate affinities, and use a robust two-stage data association and track management approach to produce the final tracks. We validate our approach on the nuScenes 3D MOT benchmark, where we observe significant improvements, particularly on classes with small physical sizes and clustered objects. As of submission, InterTrack ranks 1st in overall AMOTA among methods using CenterPoint detections.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address key issues in 3D Multi-Object Tracking (3D MOT), particularly in high-density scenarios where existing systems often overlook important contextual information when associating existing trajectories with new detection results. Specifically, the paper proposes a new method called InterTrack, which introduces an Interaction Transformer to generate more distinctive object representations, thereby improving the accuracy of data association.
### Background and Motivation
3D Multi-Object Tracking is a crucial task in autonomous driving vehicles and robotics technology, enabling systems to understand their environment and respond accordingly. Common approaches are based on the detection-based tracking paradigm, where the tracker consumes independently generated 3D detection results as input. In each frame, existing trajectories are matched with new detection results by estimating a trajectory/detection affinity matrix and matching high-affinity pairs. However, incorrect associations can lead to identity switches and false trajectory initialization, affecting decision-making in subsequent frames.
### Limitations of Existing Methods
1. **High Feature Similarity**: Existing methods typically extract features for each trajectory and detection independently, leading to high feature similarity among densely clustered objects, making it difficult to distinguish between correct and incorrect matching hypotheses.
2. **Lack of Global Information**: Most existing methods lack the aggregation of global information, resulting in non-discriminative features.
3. **Duplicate Trajectory Problem**: Many existing methods lack the capability to filter out duplicate trajectories, leading to additional false positives and performance degradation.
### Solution
To overcome the above issues, the paper proposes the InterTrack method, with the main contributions including:
1. **Interaction Transformer**: Introducing an Interaction Transformer that utilizes the Transformer model to model spatiotemporal interactions among all object pairs, maintaining high computational efficiency through the attention mechanism. Comprehensive interaction modeling enhances the feature distinctiveness among all object combinations, thereby improving overall tracking performance.
2. **Affinity Estimation Pipeline**: Designing an end-to-end method to estimate trajectory/detection affinity. Using detection results and LiDAR point clouds, it extracts complete state and shape information for each trajectory and detection, and aggregates contextual information through the Interaction Transformer. Each trajectory/detection feature pair is used to regress affinity scores.
3. **Trajectory Rejection Module**: Introducing a duplicate trajectory rejection strategy to reduce false positives by removing trajectories that overlap beyond a specified 3D Intersection over Union (IoU) threshold.
### Experimental Results
InterTrack performs excellently on the nuScenes 3D MOT benchmark, achieving significant improvements, especially in small physical size and clustered object categories. Among all methods using CenterPoint detection results, InterTrack ranks first in the overall AMOTA metric, with improvements of 2.20%, 4.70%, and 2.60% in the bicycle, motorcycle, and pedestrian categories, respectively.
### Conclusion
InterTrack addresses key issues in 3D Multi-Object Tracking by introducing the Interaction Transformer and the trajectory rejection module, particularly improving data association accuracy in high-density scenarios. Experimental results validate the effectiveness and superiority of this method.