MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

Simiao Lai,Chang Liu,Jiawen Zhu,Ben Kang,Yang Liu,Dong Wang,Huchuan Lu
2024-08-15
Abstract:Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address issues in RGB-T tracking, particularly the limitations in utilizing spatiotemporal information. Specifically: 1. **Limitations of Existing Methods**: Current Transformer-based RGB-T tracking algorithms have made significant progress through their global interaction capabilities and rich pre-trained models. However, these methods mainly rely on image pair matching, and the inherent high quadratic complexity of the attention mechanism limits the utilization of temporal information. 2. **Proposed New Method**: Inspired by the recently emerged state-space model Mamba, this work innovatively proposes a pure Mamba framework (MambaVT) to fully leverage spatiotemporal context modeling for robust visible-thermal infrared tracking. This framework includes long-range cross-frame integration components and short-term historical trajectory hints, modeling context from both global and local perspectives, thereby achieving more comprehensive utilization of spatiotemporal information. 3. **Experimental Results**: Extensive experiments demonstrate that MambaVT performs excellently on 4 mainstream benchmarks while maintaining low computational costs. The authors hope that this work can serve as a simple yet powerful baseline to inspire further research in this field. In summary, this paper addresses the shortcomings of existing RGB-T tracking algorithms in integrating temporal and spatial information by introducing the Mamba model, thereby improving tracking robustness and overall performance.