Abstract:Current event-/frame-event based trackers undergo evaluation on short-term tracking datasets, however, the tracking of real-world scenarios involves long-term tracking, and the performance of existing tracking algorithms in these scenarios remains unclear. In this paper, we first propose a new long-term and large-scale frame-event single object tracking dataset, termed FELT. It contains 742 videos and 1,594,474 RGB frames and event stream pairs and has become the largest frame-event tracking dataset to date. We re-train and evaluate 15 baseline trackers on our dataset for future works to compare. More importantly, we find that the RGB frames and event streams are naturally incomplete due to the influence of challenging factors and spatially sparse event flow. In response to this, we propose a novel associative memory Transformer network as a unified backbone by introducing modern Hopfield layers into multi-head self-attention blocks to fuse both RGB and event data. Extensive experiments on RGB-Event (FELT), RGB-Thermal (RGBT234, LasHeR), and RGB-Depth (DepthTrack) datasets fully validated the effectiveness of our model. The dataset and source code can be found at \url{

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in long - time - series frame - event (Frame - Event) visual tracking, the performance of existing algorithms in practical scenarios is not good. Specifically, current event - based or frame - event - based trackers are mainly evaluated on short - term tracking datasets, while real - world applications usually require long - time tracking. Therefore, the performance of these trackers in long - time - tracking scenarios is still unclear. In addition, due to the influence of challenging factors (such as low light, fast motion, severe occlusion, etc.) and spatially sparse event streams, RGB frames and event streams are naturally incomplete. These problems lead to the poor performance of traditional fusion methods in long - time - series tracking tasks. To address the above challenges, this paper proposes a large - scale long - time - series frame - event single - target tracking dataset FELT, and retrains and evaluates 15 baseline trackers. At the same time, the authors propose a novel Associative Memory Transformer network (AMTTrack), which fuses RGB and event data by introducing a modern Hopfield layer to improve the performance of long - time - series tracking. ### Main contributions: 1. **Propose the first long - time - series frame - event single - target tracking dataset FELT**: It contains 742 videos and 1,594,474 pairs of RGB frames and event streams, becoming the largest frame - event tracking dataset so far. 2. **Propose a new frame - event visual tracking framework AMTTrack based on the Associative Memory Transformer**: For the first time, introduce the modern Hopfield layer into the Transformer to enhance its memory ability, which is suitable for long - time - series frame - event tracking. 3. **Retrain and evaluate 15 representative trackers**: Provide extensive baseline results on the FELT SOT dataset, providing a comparison benchmark for future research. ### Dataset characteristics: - **Long - term**: Each video contains at least 1,000 frames and event streams. - **Large - scale**: Aims to construct the largest frame - event tracking dataset. - **Multiple challenges**: The collected videos reflect the key challenges of frame - event tracking. - **Dual - modality**: Meets the needs of event - only and frame - event fusion tracking tasks. - **Incomplete information**: Consider the particularity of multi - modal tracking and handle the incomplete information of each modality. ### Method overview: - **Input representation**: Convert RGB frames and event point streams into image patches and event voxels of the template and search area. - **Associative Memory Transformer**: By introducing the modern Hopfield layer and the multi - head self - attention mechanism, achieve feature extraction, interactive learning and fusion. - **Tracking head and loss function**: Use a fully convolutional network (FCN) for target localization, and adopt a comprehensive loss function combining focal loss, L1 loss and GIoU loss for training. ### Experimental results: - **Performance on the FELT dataset**: AMTTrack reaches 45.5% and 57.2% in the success rate (SR) and precision rate (PR) indicators respectively, outperforming most existing trackers. - **Generalization ability on other datasets**: It also performs well on the RGB - thermal imaging (RGBT234) and RGB - depth (DepthTrack) datasets, verifying the universality and effectiveness of the method. In conclusion, this paper significantly improves the performance of long - time - series frame - event visual tracking by proposing a new dataset and tracking framework, and provides valuable resources and baselines for future research.

Long-term Frame-Event Visual Tracking: Benchmark Dataset and Baseline

Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline

RASTMTrack: Robust and Adaptive Space-Time Memory Networks for Visual Tracking

RGB-T Object Tracking:Benchmark and Baseline

VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows

Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric

Mamba-FETrack: Frame-Event Tracking via State Space Model

Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking

TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking

Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline

Multi-Stage Fusion for Event-based Multimodal Tracker

Frame-Event Alignment and Fusion Network for High Frame Rate Tracking

DepthTrack : Unveiling the Power of RGBD Tracking

CRSOT: Cross-Resolution Object Tracking using Unaligned Frame and Event Cameras

RGB-T Long-Term Tracking Algorithm Via Local Sampling and Global Proposals

AMATrack: A Unified Network With Asymmetric Multimodal Mixed Attention for RGBD Tracking

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

BlinkTrack: Feature Tracking over 100 FPS via Events and Images

Distractor-Aware Event-Based Tracking