MambaEVT: Event Stream based Visual Object Tracking using State Space Model

Xiao Wang,Chao wang,Shiao Wang,Xixi Wang,Zhicheng Zhao,Lin Zhu,Bo Jiang
2024-08-20
Abstract:Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on <a class="link-external link-https" href="https://github.com/Event-AHU/MambaEVT" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main problems in visual object tracking (VOT) based on event cameras: 1. **High computational complexity**: - Most of the current tracking algorithms based on event streams adopt visual transformers (such as ViT), and the computational complexity of their self - attention mechanisms is \(O(N^2)\), which makes the computational cost very high in actual hardware deployment. Therefore, a more efficient model is needed to reduce the computational complexity. 2. **Static target templates**: - Existing event - camera tracking algorithms usually adopt the Siamese tracking framework and use static templates to generate activation maps. Although it performs well in simple scenarios, its performance is not satisfactory in long - term tracking or when facing significant appearance changes of the target object. Therefore, a dynamic template update strategy needs to be introduced to improve the robustness of tracking. To address these problems, the paper proposes a new event - camera visual object tracking framework based on the Mamba model - **MambaEVT**. Specifically, the main contributions of MambaEVT include: 1. **Proposing a state - space model (SSM) framework based on the Mamba model**, which realizes efficient processing of feature extraction, interaction, and fusion, with a computational complexity of \(O(N)\), significantly reducing the computational cost. 2. **Introducing the Memory Mamba network for dynamic template update**. By maintaining a dynamic template library and adjusting the templates in a timely manner according to the changes in the target's appearance, the accuracy and robustness of tracking are improved. 3. **Conducting extensive experiments on multiple large - scale event - camera tracking datasets (such as EventVOT, VisEvent, FE240hz)** to verify the effectiveness and efficiency of the proposed MambaEVT framework. Through these improvements, MambaEVT significantly reduces the consumption of computational resources while ensuring tracking accuracy, and is suitable for more challenging scenarios, such as large - scale intelligent surveillance, the military field, and aerospace.