Abstract:Visual object tracking, which is primarily based on visible light image sequences, encounters numerous challenges in complicated scenarios, such as low light conditions, high dynamic ranges, and background clutter. To address these challenges, incorporating the advantages of multiple visual modalities is a promising solution for achieving reliable object tracking. However, the existing approaches usually integrate multimodal inputs through adaptive local feature interactions, which cannot leverage the full potential of visual cues, thus resulting in insufficient feature modeling. In this study, we propose a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking. The MMHT model employs a hybrid backbone consisting of an artificial neural network (ANN) and a spiking neural network (SNN) to extract dominant features from different visual modalities and then uses a unified encoder to align the features across different domains. Moreover, we propose an enhanced transformer-based module to fuse multimodal features using attention mechanisms. With these methods, the MMHT model can effectively construct a multiscale and multidimensional visual feature space and achieve discriminative feature modeling. Extensive experiments demonstrate that the MMHT model exhibits competitive performance in comparison with that of other state-of-the-art methods. Overall, our results highlight the effectiveness of the MMHT model in terms of addressing the challenges faced in visual object tracking tasks.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the challenges of visual object tracking in complex scenarios, particularly under low-light conditions, high dynamic range, and cluttered backgrounds. Existing multimodal fusion methods typically integrate multimodal inputs through adaptive local feature interaction, but this approach fails to fully utilize visual cues, leading to insufficient feature modeling. To overcome these challenges, the authors propose a new Multimodal Hybrid Tracker (MMHT) that uses frame-event data for reliable single-object tracking. ### Specific Problems and Solutions 1. **Object Tracking in Complex Scenarios**: - **Problem**: Traditional object tracking based on visible light image sequences performs poorly in complex scenarios such as low-light conditions, high dynamic range, and cluttered backgrounds. - **Solution**: Enhance the robustness of feature modeling by introducing multimodal data (e.g., thermal imaging, depth information, and event data). 2. **Effective Utilization of Multimodal Data**: - **Problem**: Existing methods typically integrate multimodal inputs through adaptive local feature interaction, but this approach fails to fully utilize visual cues, leading to insufficient feature modeling. - **Solution**: Propose a new Multimodal Hybrid Tracker (MMHT) that uses a hybrid backbone network (including Artificial Neural Networks (ANN) and Spiking Neural Networks (SNN)) to extract dominant features from different visual modalities and employs a unified encoder to align features from different domains. Additionally, an enhanced Transformer-based module is proposed to fuse multimodal features through an attention mechanism. 3. **Challenges in Multimodal Feature Extraction and Fusion**: - **Problem**: Design a hybrid feature extraction network that can effectively explore visual cues in frame-event inputs and develop a new framework for feature alignment and fusion. - **Solution**: - **Multimodal Feature Extraction (MMFE)**: Combine the advantages of ANN and SNN to construct a multi-scale and multi-dimensional visual representation space. - **Transformer-based Feature Fusion (TFF)**: Achieve cross-domain feature alignment and fusion by introducing a cross-attention mechanism. ### Experimental Validation Extensive experiments demonstrate that the MMHT model outperforms other state-of-the-art methods on multiple benchmark datasets (FE108, COESOT, and VisEvent). The experimental results show that the MMHT model excels in handling object tracking tasks in complex scenarios. ### Conclusion This paper addresses the challenges of traditional object tracking in complex scenarios by proposing a new Multimodal Hybrid Tracker (MMHT). By effectively utilizing frame-event data, the robustness and accuracy of feature modeling are significantly improved.

Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion

MM-Tracker: Visual Tracking with A Multi-Task Model Integrating Detection and Differentiating Feature Extraction

Multi-features Guided Robust Visual Tracking.

MATI: Multimodal Adaptive Tracking Integrator for Robust Visual Object Tracking

Visible and Infrared Object Tracking via Convolution-Transformer Network With Joint Multimodal Feature Learning

MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking

CVTrack: Combined Convolutional Neural Network and Vision Transformer Fusion Model for Visual Tracking

Multiple object tracking with appearance feature prediction and similarity fusion

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

Multi-object tracking via deep feature fusion and association analysis

Visible and Infrared Object Tracking Based on Multimodal Hierarchical Relationship Modeling

Leveraging temporal-aware fine-grained features for robust multiple object tracking

Robust Multi-Modality Multi-Object Tracking

CMAT: Integrating Convolution Mixer and Self-Attention for Visual Tracking

Multimodal Multiobject Tracking by Fusing Deep Appearance Features and Motion Information

Deep Object Tracking with Multi-Modal Data

Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object Tracking

Sparse Mixed Attention Aggregation Network for Multimodal Images Fusion Tracking

A Dynamic 3D Multi-Object Tracking Method Based on Spatiotemporal Features

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Multi-Object Tracking with Partial-Level Features and Adaptive Threshold Mechanism