Abstract:Tracking in urban street scenes plays a central role in autonomous systems such as self-driving cars. Most of the current vision-based tracking methods perform tracking in the image domain. Other approaches, eg based on LIDAR and radar, track purely in 3D. While some vision-based tracking methods invoke 3D information in parts of their pipeline, and some 3D-based methods utilize image-based information in components of their approach, we propose to use image- and world-space information jointly throughout our method. We present our tracking pipeline as a 3D extension of image-based tracking. From enhancing the detections with 3D measurements to the reported positions of every tracked object, we use world-space 3D information at every stage of processing. We accomplish this by our novel coupled 2D-3D Kalman filter, combined with a conceptually clean and extendable hypothesize-and-select framework. Our approach matches the current state-of-the-art on the official KITTI benchmark, which performs evaluation in the 2D image domain only. Further experiments show significant improvements in 3D localization precision by enabling our coupled 2D-3D tracking.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve more accurate tracking of traffic participants (such as pedestrians, cars and bicycles) in urban street scenes. Specifically, the paper proposes a new tracking framework that combines information in the image domain and the world space to improve the accuracy and robustness of tracking. The following are the main contributions of the paper: 1. **Propose a new tracking framework**: This framework utilizes 2D and 3D measurement data. By combining object detection (e.g., cars, pedestrians) and 3D object proposals obtained from 3D point clouds, this method takes full advantage of the two information sources: 2D detection provides category information, while 3D proposals help to locate objects in world coordinates. 2. **Introduce a new 2D - 3D Kalman filter**: This filter maintains position and size estimates in both the image domain and the world space simultaneously. These estimates are loosely coupled through projection and back - projection operations to ensure the consistency of the trajectory. This coupling enables the tracking of distant objects and a smooth transition using more accurate information at close range. 3. **Demonstrate competitiveness on the KITTI benchmark**: In addition to the evaluation in the image domain, this method also evaluates the accuracy in 3D space to quantify its advantages. ### Specific Problem Description Most current vision - based tracking methods only perform tracking in the image domain, while in mobile robot and autonomous driving scenarios, accurate 3D positioning and trajectory estimation are crucial. To prevent collisions, especially for objects close to the camera, it is very important to know the position and orientation of objects in the world space. However, the accuracy of 3D stereo measurement in existing methods drops rapidly at long distances. Therefore, the paper proposes a method that combines 2D and 3D information to improve the accuracy of image - domain tracking and achieve significant improvement in 3D positioning. ### Solution The method proposed in the paper includes the following key steps: 1. **Observation fusion model**: Generate observations by combining 2D detection and 3D object proposals. Use the conditional random field (CRF) model to select appropriate observations and exclude unreasonable ones. 2. **Joint 2D - 3D Kalman filter**: By extending the traditional Kalman filter, handle 2D and 3D states simultaneously. Through projection and back - projection operations, keep the 2D and 3D states consistent. 3. **Hypothesis generation and selection**: Generate excessive trajectory hypotheses and select the best hypothesis through the CRF model. This ensures that the generated trajectories are physically reasonable and do not overlap in space - time. ### Experimental Results The paper conducted experiments on the KITTI benchmark, and the results show: - **Positioning error analysis**: In different distance ranges, the method proposed in the paper performs well in terms of depth error and lateral error, especially at long distances. - **Ablation study**: By turning off part of the pipeline, the influence of each component on the overall performance is verified. The results show that scene flow and 3D measurement contribute significantly to improving tracking accuracy. In conclusion, by combining 2D and 3D information, the paper proposes a new tracking framework, which significantly improves the tracking accuracy and robustness of traffic participants in urban street scenes.

Combined Image- and World-Space Tracking in Traffic Scenes

Object tracking with 3D LIDAR via multi-task sparse learning

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

Accurate and Real-Time 3-D Tracking for the Following Robots by Fusing Vision and Ultrasonar Information

Integration of the 3D Environment for UAV Onboard Visual Object Tracking

Urban Traffic Surveillance (UTS): A fully probabilistic 3D tracking approach based on 2D detections

Online Multi-Object Tracking Using Joint Domain Information in Traffic Scenarios

DirectTracker: 3D Multi-Object Tracking Using Direct Image Alignment and Photometric Bundle Adjustment

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties

Monocular Quasi-Dense 3D Object Tracking

Joint Monocular 3D Vehicle Detection and Tracking

Environment Perception Framework Fusing Multi-Object Tracking, Dynamic Occupancy Grid Maps and Digital Maps

Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving

Behavioral Pedestrian Tracking Using a Camera and LiDAR Sensors on a Moving Vehicle

A Tracking-By-Detection Based 3D Multiple Object Tracking for Autonomous Driving

Dynamic Object Tracking for Self-Driving Cars Using Monocular Camera and LIDAR.

3D Extended Object Tracking by Fusing Roadside Sparse Radar Point Clouds and Pixel Keypoints

First-person Multiple Object Tracking in Complex Traffic Scenes

3D Multi-Object Tracking with Adaptive Cubature Kalman Filter for Autonomous Driving

Multiple-Kernel Based Vehicle Tracking Using 3D Deformable Model and Camera Self-Calibration

Cross-Modal 3D Object Detection and Tracking for Auto-Driving