GSOT3D: Towards Generic 3D Single Object Tracking in the Wild

Yifan Jiao,Yunhao Li,Junhua Ding,Qing Yang,Song Fu,Heng Fan,Libo Zhang
2024-12-03
Abstract:In this paper, we present a novel benchmark, GSOT3D, that aims at facilitating development of generic 3D single object tracking (SOT) in the wild. Specifically, GSOT3D offers 620 sequences with 123K frames, and covers a wide selection of 54 object categories. Each sequence is offered with multiple modalities, including the point cloud (PC), RGB image, and depth. This allows GSOT3D to support various 3D tracking tasks, such as single-modal 3D SOT on PC and multi-modal 3D SOT on RGB-PC or RGB-D, and thus greatly broadens research directions for 3D object tracking. To provide highquality per-frame 3D annotations, all sequences are labeled manually with multiple rounds of meticulous inspection and refinement. To our best knowledge, GSOT3D is the largest benchmark dedicated to various generic 3D object tracking tasks. To understand how existing 3D trackers perform and to provide comparisons for future research on GSOT3D, we assess eight representative point cloud-based tracking models. Our evaluation results exhibit that these models heavily degrade on GSOT3D, and more efforts are required for robust and generic 3D object tracking. Besides, to encourage future research, we present a simple yet effective generic 3D tracker, named PROT3D, that localizes the target object via a progressive spatial-temporal network and outperforms all current solutions by a large margin. By releasing GSOT3D, we expect to advance further 3D tracking in future research and applications. Our benchmark and model as well as the evaluation results will be publicly released at our webpage <a class="link-external link-https" href="https://github.com/ailovejinx/GSOT3D" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of current 3D single - object tracking (3D SOT) benchmark datasets in terms of category diversity, scene diversity, and degrees of freedom (DoF), etc. These problems impede the development of general - purpose 3D SOT. Specifically: 1. **Lack of category diversity**: Existing 3D SOT datasets (such as KITTI and NuScenes) are mainly designed for autonomous driving and contain a very limited number of object categories (for example, 8 categories in KITTI and 23 categories in NuScenes), which makes them unsuitable for training and evaluating general - purpose 3D SOT models. 2. **Limited scenes**: General - purpose 3D SOT models need to be able to locate target objects in various environments, but existing datasets only provide sequences in traffic scenes, which limits the generalization ability of the models. 3. **Limited degrees of freedom**: General - purpose 3D SOT models need to handle objects with arbitrary poses and sizes, which are usually described by 9DoF (6D pose and 3D size). However, the targets in existing datasets only have 7DoF (4D pose and 3D size), which is not conducive to the development of general - purpose 3D SOT models that can handle objects with arbitrary poses. To overcome these limitations, the authors propose a new benchmark dataset GSOT3D, which has the following characteristics: - **Rich object categories**: It contains 54 object categories, covering common targets in daily life. - **Multi - modal support**: Each sequence provides data in multiple modalities such as point clouds, RGB images, and depth maps, supporting different 3D SOT tasks, such as unimodal 3D SOT (based on point clouds) and multimodal 3D SOT (based on RGB - point cloud or RGB - depth). - **Large - scale data**: It contains 620 sequences and more than 123,000 frames, which is currently the largest benchmark dataset dedicated to general - purpose 3D SOT. - **High - quality annotation**: All sequences are manually annotated, using 9DoF 3D bounding boxes, and have been carefully checked and corrected in multiple rounds to ensure high - precision annotation. In addition, the authors also evaluate 8 representative point - cloud - based 3D SOT models to understand the performance of existing models on GSOT3D and provide a comparison benchmark for future research. The evaluation results show that the performance of existing models on GSOT3D drops significantly, indicating that more efforts are still needed to achieve robust and general - purpose 3D SOT. Finally, the authors propose a simple and effective general - purpose 3D SOT model PROT3D. This model gradually refines the features of the search area through a progressive spatio - temporal network and finally achieves better performance than other methods. The core of PROT3D is a progressive spatio - temporal architecture. Through multi - stage spatio - temporal matching and feature refinement, it gradually learns more discriminative features, thereby enabling more accurate tracking in complex scenes.