Abstract:Multimodal sensing has proven valuable for visual tracking, as different sensor types offer unique strengths in handling one specific challenging scene where object appearance varies. While a generalist model capable of leveraging all modalities would be ideal, development is hindered by data sparsity, typically in practice, only one modality is available at a time. Therefore, it is crucial to ensure and achieve that knowledge gained from multimodal sensing -- such as identifying relevant features and regions -- is effectively shared, even when certain modalities are unavailable at inference. We venture with a simple assumption: similar samples across different modalities have more knowledge to share than otherwise. To implement this, we employ a ``weak" classifier tasked with distinguishing between modalities. More specifically, if the classifier ``fails" to accurately identify the modality of the given sample, this signals an opportunity for cross-modal knowledge sharing. Intuitively, knowledge transfer is facilitated whenever a sample from one modality is sufficiently close and aligned with another. Technically, we achieve this by routing samples from one modality to the expert of the others, within a mixture-of-experts framework designed for multimodal video object tracking. During the inference, the expert of the respective modality is chosen, which we show to benefit from the multimodal knowledge available during training, thanks to the proposed method. Through the exhaustive experiments that use only paired RGB-E, RGB-D, and RGB-T during training, we showcase the benefit of the proposed method for RGB-X tracker during inference, with an average +3\% precision improvement over the current SOTA. Our source code is publicly available at <a class="link-external link-https" href="https://github.com/supertyd/XTrack/tree/main" rel="external noopener nofollow">this https URL</a>.

Multi-modal interaction with token division strategy for RGB-T tracking

Multi-modal multi-task feature fusion for RGBT tracking

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Object fusion tracking for RGB-T images via channel swapping and modal mutual attention

RGB-T Tracking with Template-Bridged Search Interaction and Target-Preserved Template Updating

Exploring fusion strategies for accurate RGBT visual object tracking

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

RGB-T Tracking Based on Mixed Attention

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

Visible and Infrared Object Tracking Based on Multimodal Hierarchical Relationship Modeling

An epidemiological study of nosocomial infections in the patientsadmitted in the intensive care unit of Urmia Imam Reza Hospital: An etiological investigation

RGB-T tracking by modality difference reduction and feature re-selection

Multi-features Guided Robust Visual Tracking.

Multi-Level Fusion for Robust RGBT Tracking via Enhanced Thermal Representation

MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking

RGBT tracking via cross-modality message passing

MIRNet: A Robust RGBT Tracking Jointly with Multi-Modal Interaction and Refinement

Temporal Adaptive RGBT Tracking with Modality Prompt

QueryTrack: Joint-Modality Query Fusion Network for RGBT Tracking

Multi-Stage Fusion for Event-based Multimodal Tracker