Abstract:RGB-D tracking significantly improves the accuracy of object tracking. However, its dependency on real depth inputs and the complexity involved in multi-modal fusion limit its applicability across various scenarios. The utilization of depth information in RGB-D tracking inspired us to propose a new method, named MDETrack, which trains a tracking network with an additional capability to understand the depth of scenes, through supervised or self-supervised auxiliary Monocular Depth Estimation learning. The outputs of MDETrack's unified feature extractor are fed to the side-by-side tracking head and auxiliary depth estimation head, respectively. The auxiliary module will be discarded in inference, thus keeping the same inference speed. We evaluated our models with various training strategies on multiple datasets, and the results show an improved tracking accuracy even without real depth. Through these findings we highlight the potential of depth estimation in enhancing object tracking performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in object tracking, how to reduce the dependence on real - depth information and increase the complexity of multi - modal fusion to achieve a wider range of application scenarios. Specifically, the author proposes a new method - MDETrack, which enhances the tracking network's ability to understand the scene depth through self - supervised or supervised monocular depth - estimation learning. This method aims to improve the accuracy of object tracking without using actual depth information. ### Background and Problems of the Paper 1. **Limitations of RGB - D Tracking** - Although RGB - D tracking significantly improves the accuracy of object tracking, it depends on real - depth input and complex multi - modal fusion techniques, which limit its application in various scenarios. - It is very difficult to obtain large - scale real - depth data, which further limits the applicability of RGB - D tracking. 2. **Motivation for Auxiliary Learning** - Auxiliary learning enhances the performance of the main task (such as object tracking) by introducing related tasks. When the main task is difficult to improve in accuracy due to insufficient data, etc., introducing an auxiliary task can enhance the generalization ability of the model. - The author believes that depth - estimation learning can be used as an auxiliary task for object tracking, and the tracking performance can be enhanced by sharing learnable components. ### Proposed Method 1. **MDETrack Framework** - **Unified Feature Extractor**: MDETrack uses a unified feature extractor, and the output of this extractor is sent to the tracking head and the auxiliary depth - estimation head respectively. - **Auxiliary Depth - Estimation Module**: During the training process, through self - supervised or supervised monocular depth - estimation learning, the network's ability to understand the scene depth is enhanced. In the inference stage, the auxiliary module is discarded to maintain the same inference speed. - **Light - weight Visual Transformer**: To ensure tracking efficiency, MDETrack adopts a light - weight visual transformer network. 2. **Data Pre - processing** - The data pre - processing procedures for object tracking and depth - estimation are unified, and a shared feature extraction module is used to utilize additional spatial and temporal information. - Through techniques such as image padding and cropping, the effectiveness of self - supervised depth - estimation is ensured. 3. **Training Strategy** - **Supervised Auxiliary Learning**: Supervised learning is carried out using a training set with real - depth data to optimize the depth - estimation loss. - **Self - supervised Auxiliary Learning**: The depth - estimation module is trained through camera - pose prediction and image reconstruction to reduce the dependence on real - depth data. ### Experimental Results 1. **Supervised Auxiliary Learning** - Experiments are carried out on the DepthTrack dataset, and the results show that supervised auxiliary learning can significantly improve the tracking performance. - Compared with the baseline method, supervised auxiliary learning has improvements in multiple indicators. 2. **Self - supervised Auxiliary Learning** - Experiments are carried out using the LaSOT, GOT - 10K and DepthTrack datasets, and the results show that self - supervised auxiliary learning can still improve the tracking performance without real - depth data. - Compared with the baseline method, self - supervised auxiliary learning also shows significant improvements in multiple indicators. ### Conclusion MDETrack successfully reduces the dependence on real - depth information and improves the accuracy of object tracking by introducing self - supervised or supervised monocular depth - estimation learning. The experimental results on multiple datasets show that MDETrack has high generalization ability and practical application value.

Enhanced Object Tracking by Self-Supervised Auxiliary Depth Estimation Learning

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Monocular Depth Estimation Based on Unsupervised Learning

Depth-aware gaze-following via auxiliary networks for robotics

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking

AMATrack: A Unified Network With Asymmetric Multimodal Mixed Attention for RGBD Tracking

Feature enhancement and coarse-to-fine detection for RGB-D tracking

Depth Attention for Robust RGB Tracking

DepthTrack : Unveiling the Power of RGBD Tracking

Depth Cue Enhancement and Guidance Network for RGB-D Salient Object Detection

Object-Level Pseudo-3D Lifting for Distance-Aware Tracking

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

RGB-D Tracking Via Hierarchical Modality Aggregation and Distribution Network.

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Single-Model and Any-Modality for Video Object Tracking

MDSNet: self-supervised monocular depth estimation for video sequences using self-attention and threshold mask

Depth-Enhancement Network for Monocular 3D object detection

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking