ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer

Tianye Ding,Hongyu Li,Huaizu Jiang
2024-10-25
Abstract:Obstacle detection and tracking represent a critical component in robot autonomous navigation. In this paper, we propose ODTFormer, a Transformer-based model to address both obstacle detection and tracking problems. For the detection task, our approach leverages deformable attention to construct a 3D cost volume, which is decoded progressively in the form of voxel occupancy grids. We further track the obstacles by matching the voxels between consecutive frames. The entire model can be optimized in an end-to-end manner. Through extensive experiments on DrivingStereo and KITTI benchmarks, our model achieves state-of-the-art performance in the obstacle detection task. We also report comparable accuracy to state-of-the-art obstacle tracking models while requiring only a fraction of their computation cost, typically ten-fold to twenty-fold less. The code and model weights will be publicly released.
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issues of obstacle detection and tracking in autonomous robot navigation. Specifically: 1. **Obstacle Detection**: In autonomous navigation, robots need to detect surrounding obstacles (such as pedestrians, poles, etc.) to avoid collisions. Existing stereo camera-based methods typically rely on depth estimation modules, converting depth maps into point clouds or voxel grids. However, this approach often requires a trade-off between speed and accuracy. 2. **Obstacle Tracking**: In dynamic environments, obstacles may be randomly moving pedestrians, thus requiring the ability to track the movement of these obstacles. Traditional tracking methods (such as the Kalman filter) usually require carefully tuned parameters, leading to insufficient robustness. Additionally, scene flow estimation methods, while capable of estimating 3D structure and motion simultaneously, are computationally expensive and unsuitable for real-time applications. ### Solution To address the above issues, the authors propose **ODTFormer**, a Transformer-based model capable of handling both obstacle detection and tracking tasks simultaneously. The main innovations include: 1. **3D Cost Volume Construction**: Unlike existing methods, ODTFormer uses deformable cross-attention to query 3D voxel features from 2D stereo image features to compute matching costs. This allows the cost volume to be constructed directly in 3D space, better aligning with scene geometry and not relying on specific dataset parameters, thus offering better generalization. 2. **Voxel Tracking**: To handle dynamic environments, the authors introduce a new obstacle tracking method that captures scene motion by matching similar voxels between two frames. By setting the volume boundary of each voxel to search for its corresponding voxel in the next frame, accuracy and efficiency are improved. 3. **End-to-End Optimization**: The entire model can be optimized end-to-end, with detection and tracking modules jointly trained, enhancing overall performance. ### Experimental Results The authors conducted extensive experiments on the **DrivingStereo** and **KITTI** benchmark datasets, showing that: - In the obstacle detection task, ODTFormer significantly outperforms existing methods, especially in IoU and Chamfer Distance metrics. - In the obstacle tracking task, ODTFormer achieves accuracy comparable to current state-of-the-art methods but with only one-tenth to one-twentieth of their computational cost. ### Conclusion ODTFormer effectively addresses the issues of obstacle detection and tracking in autonomous robot navigation through innovative 3D cost volume construction and voxel tracking methods, offering high accuracy and low computational cost.