Abstract:Obstacle detection and tracking represent a critical component in robot autonomous navigation. In this paper, we propose ODTFormer, a Transformer-based model to address both obstacle detection and tracking problems. For the detection task, our approach leverages deformable attention to construct a 3D cost volume, which is decoded progressively in the form of voxel occupancy grids. We further track the obstacles by matching the voxels between consecutive frames. The entire model can be optimized in an end-to-end manner. Through extensive experiments on DrivingStereo and KITTI benchmarks, our model achieves state-of-the-art performance in the obstacle detection task. We also report comparable accuracy to state-of-the-art obstacle tracking models while requiring only a fraction of their computation cost, typically ten-fold to twenty-fold less. The code and model weights will be publicly released.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issues of obstacle detection and tracking in autonomous robot navigation. Specifically: 1. **Obstacle Detection**: In autonomous navigation, robots need to detect surrounding obstacles (such as pedestrians, poles, etc.) to avoid collisions. Existing stereo camera-based methods typically rely on depth estimation modules, converting depth maps into point clouds or voxel grids. However, this approach often requires a trade-off between speed and accuracy. 2. **Obstacle Tracking**: In dynamic environments, obstacles may be randomly moving pedestrians, thus requiring the ability to track the movement of these obstacles. Traditional tracking methods (such as the Kalman filter) usually require carefully tuned parameters, leading to insufficient robustness. Additionally, scene flow estimation methods, while capable of estimating 3D structure and motion simultaneously, are computationally expensive and unsuitable for real-time applications. ### Solution To address the above issues, the authors propose **ODTFormer**, a Transformer-based model capable of handling both obstacle detection and tracking tasks simultaneously. The main innovations include: 1. **3D Cost Volume Construction**: Unlike existing methods, ODTFormer uses deformable cross-attention to query 3D voxel features from 2D stereo image features to compute matching costs. This allows the cost volume to be constructed directly in 3D space, better aligning with scene geometry and not relying on specific dataset parameters, thus offering better generalization. 2. **Voxel Tracking**: To handle dynamic environments, the authors introduce a new obstacle tracking method that captures scene motion by matching similar voxels between two frames. By setting the volume boundary of each voxel to search for its corresponding voxel in the next frame, accuracy and efficiency are improved. 3. **End-to-End Optimization**: The entire model can be optimized end-to-end, with detection and tracking modules jointly trained, enhancing overall performance. ### Experimental Results The authors conducted extensive experiments on the **DrivingStereo** and **KITTI** benchmark datasets, showing that: - In the obstacle detection task, ODTFormer significantly outperforms existing methods, especially in IoU and Chamfer Distance metrics. - In the obstacle tracking task, ODTFormer achieves accuracy comparable to current state-of-the-art methods but with only one-tenth to one-twentieth of their computational cost. ### Conclusion ODTFormer effectively addresses the issues of obstacle detection and tracking in autonomous robot navigation through innovative 3D cost volume construction and voxel tracking methods, offering high accuracy and low computational cost.

ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer

ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer

Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

InterTrack: Interaction Transformer for 3D Multi-Object Tracking

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

Transformer-based stereo-aware 3D object detection from binocular images

OcTr: Octree-based Transformer for 3D Object Detection

Target-aware transformer tracking with hard occlusion instance generation

MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots

STDFormer: Spatial-Temporal Motion Transformer for Multiple Object Tracking

Real-Time 3D Single Object Tracking With Transformer

Transformation-Equivariant 3D Object Detection for Autonomous Driving

STT: Stateful Tracking with Transformers for Autonomous Driving

TransCenter: Transformers With Dense Representations for Multiple-Object Tracking

Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking

Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations

MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer

DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles