Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

Xin Shuo,Zhang Zhen,Wang Mengmeng,Hou Xiaojun,Guo Yaowei,Kang Xiao,Liu Liang,Liu Yong
DOI: https://doi.org/10.1109/icra57147.2024.10610979
2024-01-01
Abstract:Tracking a specific person in 3D scene is gaining momentum due to its numerous applications in robotics. Currently, most 3D trackers focus on driving scenarios with neglected jitter and uncomplicated surroundings, which results in their severe degeneration in complex environments, especially on jolting robot platforms (only 20-60% success rate). To improve the accuracy, a Point-Video-based Transformer Tracking model (PVTrack) is presented for robots. It is the first multi-modal 3D human tracking work that incorporates point clouds together with RGB videos to achieve information complementarity. Moreover, PVTrack proposes the Siamese Point-Video Transformer for feature aggregation to overcome dynamic environments, which captures more target-aware information through the hierarchical attention mechanism adaptively. Considering the violent shaking on robots and rugged terrains, a lateral Human-ware Proposal Network is designed together with an Anti-shake Proposal Compensation module. It alleviates the disturbance caused by complex scenes as well as the particularity of the robot platform. Experiments show that our method achieves state-of-the-art performance on both KITTI/Waymo datasets and a quadruped robot for various indoor and outdoor scenes.
What problem does this paper attempt to address?