Abstract:Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on <a class="link-external link-https" href="https://github.com/Event-AHU/MambaEVT" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems in visual object tracking (VOT) based on event cameras: 1. **High computational complexity**: - Most of the current tracking algorithms based on event streams adopt visual transformers (such as ViT), and the computational complexity of their self - attention mechanisms is \(O(N^2)\), which makes the computational cost very high in actual hardware deployment. Therefore, a more efficient model is needed to reduce the computational complexity. 2. **Static target templates**: - Existing event - camera tracking algorithms usually adopt the Siamese tracking framework and use static templates to generate activation maps. Although it performs well in simple scenarios, its performance is not satisfactory in long - term tracking or when facing significant appearance changes of the target object. Therefore, a dynamic template update strategy needs to be introduced to improve the robustness of tracking. To address these problems, the paper proposes a new event - camera visual object tracking framework based on the Mamba model - **MambaEVT**. Specifically, the main contributions of MambaEVT include: 1. **Proposing a state - space model (SSM) framework based on the Mamba model**, which realizes efficient processing of feature extraction, interaction, and fusion, with a computational complexity of \(O(N)\), significantly reducing the computational cost. 2. **Introducing the Memory Mamba network for dynamic template update**. By maintaining a dynamic template library and adjusting the templates in a timely manner according to the changes in the target's appearance, the accuracy and robustness of tracking are improved. 3. **Conducting extensive experiments on multiple large - scale event - camera tracking datasets (such as EventVOT, VisEvent, FE240hz)** to verify the effectiveness and efficiency of the proposed MambaEVT framework. Through these improvements, MambaEVT significantly reduces the consumption of computational resources while ensuring tracking accuracy, and is suitable for more challenging scenarios, such as large - scale intelligent surveillance, the military field, and aerospace.

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

Mamba-FETrack: Frame-Event Tracking via State Space Model

MambaTrack: A Simple Baseline for Multiple Object Tracking with State Space Model

A Compensatory Algorithm for High-Speed Visual Object Tracking Based on Markov Chain

MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

EVtracker: An Event-Driven Spatiotemporal Method for Dynamic Object Tracking

TrackingMamba: Visual State Space Model for Object Tracking

MIMTracking: Masked image modeling enhanced vision transformer for visual object tracking

VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows

MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking

Rethinking Efficient and Effective Point-based Networks for Event Camera Classification and Regression: EventMamba

Dynamic memory network with spatial-temporal feature fusion for visual tracking

Learning Spatio-Appearance Memory Network for High-Performance Visual Tracking

Suppression of MED19 expression by shRNA induces inhibition of cell proliferation and tumorigenesis in human prostate cancer cells.

EVIT: Event-based Visual-Inertial Tracking in Semi-Dense Maps Using Windowed Nonlinear Optimization

MM-Tracker: Visual Tracking with A Multi-Task Model Integrating Detection and Differentiating Feature Extraction

Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model

Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking