Abstract:3D single object tracking is a key issue for robotics. In this paper, we propose a transformer module called Point-Track-Transformer (PTT) for point cloud-based 3D single object tracking. PTT module contains three blocks for feature embedding, position encoding, and self-attention feature computation. Feature embedding aims to place features closer in the embedding space if they have similar semantic information. Position encoding is used to encode coordinates of point clouds into high dimension distinguishable features. Self-attention generates refined attention features by computing attention weights. Besides, we embed the PTT module into the open-source state-of-the-art method P2B to construct PTT-Net. Experiments on the KITTI dataset reveal that our PTT-Net surpasses the state-of-the-art by a noticeable margin (~10%). Additionally, PTT-Net could achieve real-time performance (~40FPS) on NVIDIA 1080Ti GPU. Our code is open-sourced for the robotics community at <a class="link-external link-https" href="https://github.com/shanjiayao/PTT" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform 3D single - object tracking (3D SOT) in point - cloud data. Specifically, the authors note that existing 3D SOT methods mainly rely on RGB - D cameras, and these methods may fail in environments with visual degradation or illumination changes. In addition, although 3D LiDAR sensors are widely used in object - tracking tasks because they are insensitive to illumination changes and can more accurately capture geometric information directly, performing 3D SOT using only point clouds still faces challenges:
1. **Sparse and disordered point clouds**: This requires that the network must be permutation - invariant.
2. **3D object tracking requires estimating higher - dimensional spatial parameters** (e.g., x, y, z, w, h, l, ry), which requires more computational complexity than 2D visual tracking.
3. **Tracking non - rigid objects is more challenging**: For example, pedestrians, because it is difficult to extract stable features.
To solve these problems, the authors propose a Transformer - based module, called Point - Track - Transformer (PTT), for 3D single - object tracking in point clouds. The PTT module consists of three parts: feature embedding, position encoding, and self - attention mechanism. Through these mechanisms, the PTT module can weight point - cloud features, thereby focusing on deep cues of the target during the tracking process. In addition, the authors embed the PTT module into the existing open - source advanced method P2B to construct a new network, PTT - Net. The experimental results show that the performance of PTT - Net on the KITTI dataset is significantly better than that of existing methods, and it can also achieve real - time performance (about 40FPS).