DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker

Jiapeng Wu,Yichen Liu

2024-04-08

Abstract:Accurately distinguishing each object is a fundamental goal of Multi-object tracking (MOT) algorithms. However, achieving this goal still remains challenging, primarily due to: (i) For crowded scenes with occluded objects, the high overlap of object bounding boxes leads to confusion among closely located objects. Nevertheless, humans naturally perceive the depth of elements in a scene when observing 2D videos. Inspired by this, even though the bounding boxes of objects are close on the camera plane, we can differentiate them in the depth dimension, thereby establishing a 3D perception of the objects. (ii) For videos with rapidly irregular camera motion, abrupt changes in object positions can result in ID switches. However, if the camera pose are known, we can compensate for the errors in linear motion models. In this paper, we propose \textit{DepthMOT}, which achieves: (i) detecting and estimating scene depth map \textit{end-to-end}, (ii) compensating the irregular camera motion by camera pose estimation. Extensive experiments demonstrate the superior performance of DepthMOT in VisDrone-MOT and UAVDT datasets. The code will be available at \url{

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address two main issues in Multi-Object Tracking (MOT): 1. **Distinguishing objects in crowded scenes**: In crowded scenes, the overlap between objects causes the bounding boxes of objects to overlap significantly, making it difficult to distinguish closely adjacent objects. Nevertheless, humans can naturally perceive the depth of elements in the scene when observing 2D videos, thereby distinguishing these objects in the depth dimension and establishing a three-dimensional perception of the objects. 2. **ID switching under fast irregular camera motion**: For videos containing fast irregular camera movements, the sudden change in object positions may lead to ID switching. However, if the camera's pose is known, the impact can be reduced by compensating for errors in the linear motion model. To address these issues, the paper proposes **DepthMOT**, which achieves this through the following two aspects: - **Detection and estimation of scene depth maps**: By detecting objects and estimating the scene's depth map in an end-to-end manner, the method uses the average depth value at the bottom of the object bounding box to estimate the object's depth. - **Compensation for irregular camera motion**: By estimating the camera pose changes between adjacent frames, the method corrects the errors in the Kalman filter caused by camera motion. Experimental results show that DepthMOT performs excellently on the VisDrone-MOT and UAVDT datasets, particularly achieving significant improvements in metrics such as HOTA, MOTA, and IDF1.

DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker

Object-Level Pseudo-3D Lifting for Distance-Aware Tracking

Online Multi-Object Tracking from A Bird's-Eye View by Fusion of Millimeter-Wave Radar and Vision

FlowMOT: 3D Multi-Object Tracking by Scene Flow Association

MAT: Motion-Aware Multi-Object Tracking

Dense Scene Multiple Object Tracking with Box-Plane Matching

SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth

DroneMOT: Drone-based Multi-Object Tracking Considering Detection Difficulties and Simultaneous Moving of Drones and Objects

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

EagerMOT: 3D Multi-Object Tracking via Sensor Fusion

ByteTrackV2: 2D and 3D Multi-Object Tracking by Associating Every Detection Box

PF-MOT: Probability Fusion Based 3D Multi-Object Tracking for Autonomous Vehicles

Multimodal Multiobject Tracking by Fusing Deep Appearance Features and Motion Information

CAMO-MOT: Combined Appearance-Motion Optimization for 3D Multi-Object Tracking With Camera-LiDAR Fusion

Poly-MOT: A Polyhedral Framework for 3D Multi-Object Tracking.

CAMOT: Camera Angle-aware Multi-Object Tracking

Multi-Object Tracking Meets Moving UAV

Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking

DeepFusionMOT: A 3D Multi-Object Tracking Framework Based on Camera-LiDAR Fusion With Deep Association

OBMO: One Bounding Box Multiple Objects for Monocular 3D Object Detection

Tracking Objects with 3D Representation from Videos