Abstract:Single object tracking (SOT) in light detection and ranging (LiDAR) point clouds is a challenging problem in computer vision. Compared to object-level point clouds, scene-level point clouds for tracking are more complex, requiring long-range semantic awareness and local shape context. However, previous methods directly filter candidates under limited matched features without systematically considering these two factors. Inspired by transformer to establish long-distance dependence and convolution to capture local high-frequency information, we propose a point-tracking inception transformer (PTIT), which efficiently predicts high-quality 3-D tracking results in a coarse-to-fine manner with the support of spatio-temporal point clouds. PTIT consists of three novel designs as follows. 1) We design instance-guided sampling (IGS) to help identify and preserve the relevant points of the given template and the foreground points of the search area. 2) We propose a point inception transformer (PIT), which consists of a multifrequency attention and cross-attention module, where the former captures both remote dependency and local detail and the latter matches template and search area features. 3) After generating coarse tracking results from cross-attention, we locate the target by motion transformation in the spatio-temporal point cloud to generate a fine-grained 3-D bounding box (BBox). In addition, we perform feature augmentation on the points and boxes to mitigate the negative effects of LiDAR point clouds without texture and incompleteness. PTIT performs significantly better than previous state-of-the-art methods on KITTI and nuScenes datasets. Our further analysis confirms the effectiveness of each component and shows the great potential of the inception transformer-centric paradigm when combined with spatio-temporal point clouds. Our code is available at https://github.com/ywu0912/TeamCode.git.

CMT: Context-Matching-Guided Transformer for 3D Tracking in Point Clouds.

CXTrack: Improving 3D Point Cloud Tracking with Contextual Information

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

3D Siamese Transformer Network for Single Object Tracking on Point Clouds

PTTR: Relational 3D Point Cloud Object Tracking with Transformer

3D Object Tracking with Transformer

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

Spatio-Temporal Contextual Learning for Single Object Tracking on Point Clouds

Visual tracking using transformer with a combination of convolution and attention

OST: Efficient One-stream Network for 3D Single Object Tracking in Point Clouds

Multi-Correlation Siamese Transformer Network with Dense Connection for 3D Single Object Tracking

Exploiting More Information in Sparse Point Cloud for 3D Single Object Tracking

Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer

High-Performance Transformer Tracking

Instance-Guided Point Cloud Single Object Tracking With Inception Transformer

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

Beyond 3D Siamese Tracking: A Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds