Abstract:In the realm of video object tracking, auxiliary modalities such as depth, thermal, or event data have emerged as valuable assets to complement the RGB trackers. In practice, most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However, a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations, the scarcity of multi-modal datasets, and the absence of all the modalities at all times. In this work, we introduce Un-Track, a Unified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly, we use only the RGB-X pairs to learn the common latent space. This unique shared representation seamlessly binds all modalities together, enabling effective unification and accommodating any missing modality, all within a single transformer-based architecture. Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters, through a simple yet efficient prompting strategy. Extensive comparisons on five benchmark datasets with different modalities show that Un-Track surpasses both SOTA unified trackers and modality-specific counterparts, validating our effectiveness and practicality. The source code is publicly available at

What problem does this paper attempt to address?

This paper attempts to solve the multi - modal fusion problem in video object tracking, especially in cases where different modal data (such as depth, thermal imaging or event data) are not always available simultaneously. Specifically, the paper focuses on the following points: 1. **Unification of single - model multi - modality**: Existing RGB trackers usually use a single set of parameters in multiple datasets and applications. However, for multi - modal tracking, achieving the unification of a single model faces many challenges, including input heterogeneity, the scarcity of multi - modal datasets and the absence of certain modalities. The paper proposes a method named Un - Track, aiming to support tracking of any modality through a single set of parameters. 2. **Challenges of modality - specific representations**: Data of different modalities have different representation forms, which makes it difficult to directly fuse these modalities. In addition, the scarcity of multi - modal datasets and the absence of certain modalities further exacerbate this problem. 3. **Low - rank decomposition and reconstruction**: In order to process data of different modalities, Un - Track learns a common latent space through low - rank decomposition and reconstruction techniques. This method uses only RGB - X pairs to learn the common latent space, thus seamlessly combining all modalities together and working effectively even when certain modalities are missing. 4. **Cross - modal cues**: In order to make full use of auxiliary inputs while maintaining efficiency, Un - Track uses cross - modal features as cues to enhance RGB - X interactions. Specifically, by identifying and enhancing unreliable feature points, multi - modal cues are used to improve feature modeling. 5. **Lightweight fine - tuning**: In order to adapt to sparse downstream multi - modal datasets, Un - Track adopts a Transformer - based RGB tracker and fine - tunes it through the LoRA technique. This can improve the robustness and performance of the model without significantly increasing the computational burden. Through these methods, Un - Track has demonstrated excellent performance on multiple benchmark datasets, surpassing not only existing multi - modal trackers but also modality - specific optimized models. This shows the effectiveness and practicality of Un - Track in handling different modal data.

Single-Model and Any-Modality for Video Object Tracking

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

Unidirectional Cross-Modal Fusion for RGB-T Tracking

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

EMTrack: Efficient Multimodal Object Tracking

Cross-Modal Object Tracking: Modality-Aware Representations and a Unified Benchmark

AMATrack: A Unified Network With Asymmetric Multimodal Mixed Attention for RGBD Tracking

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

SUTrack: Towards Simple and Unified Single Object Tracking

Unified Sequence-to-Sequence Learning for Single- and Multi-Modal Visual Object Tracking

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

Learning Modality Feature Fusion Via Transformer for RGBT-tracking

X Modality Assisting RGBT Object Tracking

Unifying Visual and Vision-Language Tracking via Contrastive Learning

SwapTrack: Enhancing RGB-T Tracking Via Learning from Paired and Single-Modal Data

Unified Transformer Tracker for Object Tracking

Visible and Infrared Object Tracking Based on Multimodal Hierarchical Relationship Modeling

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking