Abstract:3D multi-object tracking (MOT) is a key problem for autonomous vehicles, required to perform well-informed motion planning in dynamic environments. Particularly for densely occupied scenes, associating existing tracks to new detections remains challenging as existing systems tend to omit critical contextual information. Our proposed solution, InterTrack, introduces the Interaction Transformer for 3D MOT to generate discriminative object representations for data association. We extract state and shape features for each track and detection, and efficiently aggregate global information via attention. We then perform a learned regression on each track/detection feature pair to estimate affinities, and use a robust two-stage data association and track management approach to produce the final tracks. We validate our approach on the nuScenes 3D MOT benchmark, where we observe significant improvements, particularly on classes with small physical sizes and clustered objects. As of submission, InterTrack ranks 1st in overall AMOTA among methods using CenterPoint detections.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address key issues in 3D Multi-Object Tracking (3D MOT), particularly in high-density scenarios where existing systems often overlook important contextual information when associating existing trajectories with new detection results. Specifically, the paper proposes a new method called InterTrack, which introduces an Interaction Transformer to generate more distinctive object representations, thereby improving the accuracy of data association. ### Background and Motivation 3D Multi-Object Tracking is a crucial task in autonomous driving vehicles and robotics technology, enabling systems to understand their environment and respond accordingly. Common approaches are based on the detection-based tracking paradigm, where the tracker consumes independently generated 3D detection results as input. In each frame, existing trajectories are matched with new detection results by estimating a trajectory/detection affinity matrix and matching high-affinity pairs. However, incorrect associations can lead to identity switches and false trajectory initialization, affecting decision-making in subsequent frames. ### Limitations of Existing Methods 1. **High Feature Similarity**: Existing methods typically extract features for each trajectory and detection independently, leading to high feature similarity among densely clustered objects, making it difficult to distinguish between correct and incorrect matching hypotheses. 2. **Lack of Global Information**: Most existing methods lack the aggregation of global information, resulting in non-discriminative features. 3. **Duplicate Trajectory Problem**: Many existing methods lack the capability to filter out duplicate trajectories, leading to additional false positives and performance degradation. ### Solution To overcome the above issues, the paper proposes the InterTrack method, with the main contributions including: 1. **Interaction Transformer**: Introducing an Interaction Transformer that utilizes the Transformer model to model spatiotemporal interactions among all object pairs, maintaining high computational efficiency through the attention mechanism. Comprehensive interaction modeling enhances the feature distinctiveness among all object combinations, thereby improving overall tracking performance. 2. **Affinity Estimation Pipeline**: Designing an end-to-end method to estimate trajectory/detection affinity. Using detection results and LiDAR point clouds, it extracts complete state and shape information for each trajectory and detection, and aggregates contextual information through the Interaction Transformer. Each trajectory/detection feature pair is used to regress affinity scores. 3. **Trajectory Rejection Module**: Introducing a duplicate trajectory rejection strategy to reduce false positives by removing trajectories that overlap beyond a specified 3D Intersection over Union (IoU) threshold. ### Experimental Results InterTrack performs excellently on the nuScenes 3D MOT benchmark, achieving significant improvements, especially in small physical size and clustered object categories. Among all methods using CenterPoint detection results, InterTrack ranks first in the overall AMOTA metric, with improvements of 2.20%, 4.70%, and 2.60% in the bicycle, motorcycle, and pedestrian categories, respectively. ### Conclusion InterTrack addresses key issues in 3D Multi-Object Tracking by introducing the Interaction Transformer and the trajectory rejection module, particularly improving data association accuracy in high-density scenarios. Experimental results validate the effectiveness and superiority of this method.

InterTrack: Interaction Transformer for 3D Multi-Object Tracking

Exploit the Connectivity: Multi-Object Tracking with TrackletNet

Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network

TransLink: Transformer-Based Embedding for Tracklets’ Global Link

Object-Level Pseudo-3D Lifting for Distance-Aware Tracking

MAT: Motion-Aware Multi-Object Tracking

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

Relation3DMOT: Exploiting Deep Affinity for 3D Multi-Object Tracking from View Aggregation

TLtrack: Combining Transformers and a Linear Model for Robust Multi-Object Tracking

HSTrack: Bootstrap End-to-End Multi-Camera 3D Multi-object Tracking with Hybrid Supervision

TrackFormer: Multi-Object Tracking with Transformers

MCTrack: A Unified 3D Multi-Object Tracking Framework for Autonomous Driving

PF-MOT: Probability Fusion Based 3D Multi-Object Tracking for Autonomous Vehicles

FastTrackTr:Towards Fast Multi-Object Tracking with Transformers

RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework

Transformer-Based Multiple-Object Tracking via Anchor-Based-Query and Template Matching

DeconfuseTrack:Dealing with Confusion for Multi-Object Tracking

ByteTrackV2: 2D and 3D Multi-Object Tracking by Associating Every Detection Box

ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association

An IMM-Enabled Adaptive 3D Multi-Object Tracker for Autonomous Driving