DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker

Jiapeng Wu,Yichen Liu
2024-04-08
Abstract:Accurately distinguishing each object is a fundamental goal of Multi-object tracking (MOT) algorithms. However, achieving this goal still remains challenging, primarily due to: (i) For crowded scenes with occluded objects, the high overlap of object bounding boxes leads to confusion among closely located objects. Nevertheless, humans naturally perceive the depth of elements in a scene when observing 2D videos. Inspired by this, even though the bounding boxes of objects are close on the camera plane, we can differentiate them in the depth dimension, thereby establishing a 3D perception of the objects. (ii) For videos with rapidly irregular camera motion, abrupt changes in object positions can result in ID switches. However, if the camera pose are known, we can compensate for the errors in linear motion models. In this paper, we propose \textit{DepthMOT}, which achieves: (i) detecting and estimating scene depth map \textit{end-to-end}, (ii) compensating the irregular camera motion by camera pose estimation. Extensive experiments demonstrate the superior performance of DepthMOT in VisDrone-MOT and UAVDT datasets. The code will be available at \url{
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address two main issues in Multi-Object Tracking (MOT): 1. **Distinguishing objects in crowded scenes**: In crowded scenes, the overlap between objects causes the bounding boxes of objects to overlap significantly, making it difficult to distinguish closely adjacent objects. Nevertheless, humans can naturally perceive the depth of elements in the scene when observing 2D videos, thereby distinguishing these objects in the depth dimension and establishing a three-dimensional perception of the objects. 2. **ID switching under fast irregular camera motion**: For videos containing fast irregular camera movements, the sudden change in object positions may lead to ID switching. However, if the camera's pose is known, the impact can be reduced by compensating for errors in the linear motion model. To address these issues, the paper proposes **DepthMOT**, which achieves this through the following two aspects: - **Detection and estimation of scene depth maps**: By detecting objects and estimating the scene's depth map in an end-to-end manner, the method uses the average depth value at the bottom of the object bounding box to estimate the object's depth. - **Compensation for irregular camera motion**: By estimating the camera pose changes between adjacent frames, the method corrects the errors in the Kalman filter caused by camera motion. Experimental results show that DepthMOT performs excellently on the VisDrone-MOT and UAVDT datasets, particularly achieving significant improvements in metrics such as HOTA, MOTA, and IDF1.