Abstract:Single object tracking (SOT) in light detection and ranging (LiDAR) point clouds is a challenging problem in computer vision. Compared to object-level point clouds, scene-level point clouds for tracking are more complex, requiring long-range semantic awareness and local shape context. However, previous methods directly filter candidates under limited matched features without systematically considering these two factors. Inspired by transformer to establish long-distance dependence and convolution to capture local high-frequency information, we propose a point-tracking inception transformer (PTIT), which efficiently predicts high-quality 3-D tracking results in a coarse-to-fine manner with the support of spatio-temporal point clouds. PTIT consists of three novel designs as follows. 1) We design instance-guided sampling (IGS) to help identify and preserve the relevant points of the given template and the foreground points of the search area. 2) We propose a point inception transformer (PIT), which consists of a multifrequency attention and cross-attention module, where the former captures both remote dependency and local detail and the latter matches template and search area features. 3) After generating coarse tracking results from cross-attention, we locate the target by motion transformation in the spatio-temporal point cloud to generate a fine-grained 3-D bounding box (BBox). In addition, we perform feature augmentation on the points and boxes to mitigate the negative effects of LiDAR point clouds without texture and incompleteness. PTIT performs significantly better than previous state-of-the-art methods on KITTI and nuScenes datasets. Our further analysis confirms the effectiveness of each component and shows the great potential of the inception transformer-centric paradigm when combined with spatio-temporal point clouds. Our code is available at https://github.com/ywu0912/TeamCode.git.

GLT-T: Global-Local Transformer Voting for 3D Single Object Tracking in Point Clouds

GLT-T++: Global-Local Transformer for 3D Siamese Tracking with Ranking Loss

GTT: Visual Tracking with Gaussion Transformer

Real-Time 3D Single Object Tracking With Transformer

Global Tracking Transformers

Global-local feature-mixed network with template update for visual tracking

Instance-Guided Point Cloud Single Object Tracking With Inception Transformer

Accurate 3D Single Object Tracker With Local-to-Global Feature Refinement

TLPG-Tracker: Joint Learning of Target Localization and Proposal Generation for Visual Tracking.

PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds

MLGT: multi-local guided tracker for visual object tracking

Integrating Scaling Strategy and Central Guided Voting for 3D Point Cloud Object Tracking

OST: Efficient One-stream Network for 3D Single Object Tracking in Point Clouds

GTTrack: Gaussian Transformer Tracker for Visual Tracking.

MLVSNet: Multi-level Voting Siamese Network for 3D Visual Tracking

VTT: Long-term Visual Tracking with Transformers

3D Object Tracking with Transformer

LGTrack: Exploiting Local and Global Properties for Robust Visual Tracking

TH-Net: A Method of Single 3d Object Tracking Based on Transformers and Hausdorff Distance

TGLC: Visual object tracking by fusion of global-local information and channel information