Abstract:Most of 3D single object trackers (SOT) in point clouds follow the two-stream multi-stage 3D Siamese or motion tracking paradigms, which process the template and search area point clouds with two parallel branches, built on supervised point cloud backbones. In this work, beyond typical 3D Siamese or motion tracking, we propose a neat and compact one-stream transformer 3D SOT paradigm from the novel perspective, termed as \textbf{EasyTrack}, which consists of three special designs: 1) A 3D point clouds tracking feature pre-training module is developed to exploit the masked autoencoding for learning 3D point clouds tracking representations. 2) A unified 3D tracking feature learning and fusion network is proposed to simultaneously learns target-aware 3D features, and extensively captures mutual correlation through the flexible self-attention mechanism. 3) A target location network in the dense bird's eye view (BEV) feature space is constructed for target classification and regression. Moreover, we develop an enhanced version named EasyTrack++, which designs the center points interaction (CPI) strategy to reduce the ambiguous targets caused by the noise point cloud background information. The proposed EasyTrack and EasyTrack++ set a new state-of-the-art performance ($\textbf{18\%}$, $\textbf{40\%}$ and $\textbf{3\%}$ success gains) in KITTI, NuScenes, and Waymo while runing at \textbf{52.6fps} with few parameters (\textbf{1.3M}). The code will be available at

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the inefficiency and complex feature fusion existing in the current 3D single - object tracking (SOT) methods when dealing with point cloud data. Specifically, traditional 3D SOT methods usually adopt the two - stream multi - stage 3D Siamese or motion - tracking paradigm. This method requires two parallel branches to process the point cloud data of the template and the search area respectively, and a complex 3D feature fusion module is needed to transfer the template information to the search area. Although these methods perform well in tracking performance, when dealing with sparse, unordered and incomplete point cloud data, there are the following challenges: 1. **Limited discrimination ability**: The separation of template and search - area points and the parallel feature - learning branches limit the discrimination ability, especially when dealing with non - rigid categories such as pedestrians. 2. **High computational cost**: Point - cloud feature fusion operations are usually necessary, but it is very difficult to perform effective feature matching in extremely incomplete point clouds, and such complex feature fusion operations bring high computational costs. 3. **Environmental interference**: Point - cloud data may be affected by environmental factors (such as illumination changes, noise, occlusion), causing the network to have difficulty in effectively learning the point - pair relationship patterns. To address these problems, the paper proposes a new and concise one - stream framework - EasyTrack for single - object tracking in 3D point clouds. The main contributions of EasyTrack include: - **Novel one - stream paradigm**: EasyTrack proposes a novel and concise one - stream paradigm without any auxiliary networks or tricks, with a running speed of 52.6fps and only 1.30M parameters. - **New point - cloud pre - training technology**: A new point - cloud pre - training technology for 3D SOT has been developed, and its excellent performance in the one - stream 3D SOT framework has been demonstrated through detailed ablation experiments. - **Unified 3D tracking feature - learning and interaction module**: A unified 3D tracking feature - learning and interaction module has been specially designed to generate target - aware point features through a single - branch backbone network. - **Enhanced version EasyTrack++**: Based on EasyTrack, EasyTrack++ has been further proposed, in which the center - point interaction strategy is applied to reduce the noise caused by background points in the global interaction stage and improve the tracking performance. In summary, this paper aims to solve the efficiency and complexity problems of existing 3D SOT methods when dealing with point cloud data by proposing a new and efficient one - stream framework, thereby achieving excellent tracking performance on multiple datasets.

EasyTrack: Efficient and Compact One-stream 3D Point Clouds Tracker

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

PointSiamRCNN: Target-aware Voxel-based Siamese Tracker for Point Clouds

OST: Efficient One-stream Network for 3D Single Object Tracking in Point Clouds

VoxelTrack: Exploring Multi-level Voxel Representation for 3D Point Cloud Object Tracking

VoxelTrack: Exploring Voxel Representation for 3D Point Cloud Object Tracking

Implicit and Efficient Point Cloud Completion for 3D Single Object Tracking

STTracker: Spatio-Temporal Tracker for 3D Single Object Tracking

PointTrackNet: An End-to-End Network For 3-D Object Detection and Tracking From Point Clouds

Exploiting More Information in Sparse Point Cloud for 3D Single Object Tracking

FlowTrack: Point-level Flow Network for 3D Single Object Tracking

Beyond 3D Siamese Tracking: A Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds

Tracking Objects as Points

3D Single Object Tracking Network Based on Point Cloud Pre-segmentation

3D Object Tracking with Transformer

BEVTrack: A Simple and Strong Baseline for 3D Single Object Tracking in Bird's-Eye View

SeqTrack3D: Exploring Sequence Information for Robust 3D Point Cloud Tracking

DeepPCT: Single Object Tracking in Dynamic Point Cloud Sequences

A Novel Object Re-Track Framework for 3D Point Clouds

CDTracker: Coarse-to-Fine Feature Matching and Point Densification for 3D Single-Object Tracking