MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

Simiao Lai,Chang Liu,Jiawen Zhu,Ben Kang,Yang Liu,Dong Wang,Huchuan Lu

2024-08-15

Abstract:Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address issues in RGB-T tracking, particularly the limitations in utilizing spatiotemporal information. Specifically: 1. **Limitations of Existing Methods**: Current Transformer-based RGB-T tracking algorithms have made significant progress through their global interaction capabilities and rich pre-trained models. However, these methods mainly rely on image pair matching, and the inherent high quadratic complexity of the attention mechanism limits the utilization of temporal information. 2. **Proposed New Method**: Inspired by the recently emerged state-space model Mamba, this work innovatively proposes a pure Mamba framework (MambaVT) to fully leverage spatiotemporal context modeling for robust visible-thermal infrared tracking. This framework includes long-range cross-frame integration components and short-term historical trajectory hints, modeling context from both global and local perspectives, thereby achieving more comprehensive utilization of spatiotemporal information. 3. **Experimental Results**: Extensive experiments demonstrate that MambaVT performs excellently on 4 mainstream benchmarks while maintaining low computational costs. The authors hope that this work can serve as a simple yet powerful baseline to inspire further research in this field. In summary, this paper addresses the shortcomings of existing RGB-T tracking algorithms in integrating temporal and spatial information by introducing the Mamba model, thereby improving tracking robustness and overall performance.

MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

Mamba-FETrack: Frame-Event Tracking via State Space Model

Exploring Multi-Modal Spatial-Temporal Contexts for High-Performance RGB-T Tracking

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

MambaTrack: A Simple Baseline for Multiple Object Tracking with State Space Model

Temporal Adaptive RGBT Tracking with Modality Prompt

Visible and Infrared Object Tracking Based on Multimodal Hierarchical Relationship Modeling

RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba

Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking

RGB-T Tracking with Template-Bridged Search Interaction and Target-Preserved Template Updating

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

TrackingMamba: Visual State Space Model for Object Tracking

MTNet: Learning Modality-aware Representation with Transformer for RGBT Tracking

Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline

VideoMamba: Spatio-Temporal Selective State Space Model

Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

RGB-T Tracking Based on Mixed Attention

SiamMGT: robust RGBT tracking via graph attention and reliable modality weight learning

MIRNet: A Robust RGBT Tracking Jointly with Multi-Modal Interaction and Refinement