Long-term Frame-Event Visual Tracking: Benchmark Dataset and Baseline

Xiao Wang,Ju Huang,Shiao Wang,Chuanming Tang,Bo Jiang,Yonghong Tian,Jin Tang,Bin Luo
2024-04-03
Abstract:Current event-/frame-event based trackers undergo evaluation on short-term tracking datasets, however, the tracking of real-world scenarios involves long-term tracking, and the performance of existing tracking algorithms in these scenarios remains unclear. In this paper, we first propose a new long-term and large-scale frame-event single object tracking dataset, termed FELT. It contains 742 videos and 1,594,474 RGB frames and event stream pairs and has become the largest frame-event tracking dataset to date. We re-train and evaluate 15 baseline trackers on our dataset for future works to compare. More importantly, we find that the RGB frames and event streams are naturally incomplete due to the influence of challenging factors and spatially sparse event flow. In response to this, we propose a novel associative memory Transformer network as a unified backbone by introducing modern Hopfield layers into multi-head self-attention blocks to fuse both RGB and event data. Extensive experiments on RGB-Event (FELT), RGB-Thermal (RGBT234, LasHeR), and RGB-Depth (DepthTrack) datasets fully validated the effectiveness of our model. The dataset and source code can be found at \url{
Computer Vision and Pattern Recognition,Artificial Intelligence,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in long - time - series frame - event (Frame - Event) visual tracking, the performance of existing algorithms in practical scenarios is not good. Specifically, current event - based or frame - event - based trackers are mainly evaluated on short - term tracking datasets, while real - world applications usually require long - time tracking. Therefore, the performance of these trackers in long - time - tracking scenarios is still unclear. In addition, due to the influence of challenging factors (such as low light, fast motion, severe occlusion, etc.) and spatially sparse event streams, RGB frames and event streams are naturally incomplete. These problems lead to the poor performance of traditional fusion methods in long - time - series tracking tasks. To address the above challenges, this paper proposes a large - scale long - time - series frame - event single - target tracking dataset FELT, and retrains and evaluates 15 baseline trackers. At the same time, the authors propose a novel Associative Memory Transformer network (AMTTrack), which fuses RGB and event data by introducing a modern Hopfield layer to improve the performance of long - time - series tracking. ### Main contributions: 1. **Propose the first long - time - series frame - event single - target tracking dataset FELT**: It contains 742 videos and 1,594,474 pairs of RGB frames and event streams, becoming the largest frame - event tracking dataset so far. 2. **Propose a new frame - event visual tracking framework AMTTrack based on the Associative Memory Transformer**: For the first time, introduce the modern Hopfield layer into the Transformer to enhance its memory ability, which is suitable for long - time - series frame - event tracking. 3. **Retrain and evaluate 15 representative trackers**: Provide extensive baseline results on the FELT SOT dataset, providing a comparison benchmark for future research. ### Dataset characteristics: - **Long - term**: Each video contains at least 1,000 frames and event streams. - **Large - scale**: Aims to construct the largest frame - event tracking dataset. - **Multiple challenges**: The collected videos reflect the key challenges of frame - event tracking. - **Dual - modality**: Meets the needs of event - only and frame - event fusion tracking tasks. - **Incomplete information**: Consider the particularity of multi - modal tracking and handle the incomplete information of each modality. ### Method overview: - **Input representation**: Convert RGB frames and event point streams into image patches and event voxels of the template and search area. - **Associative Memory Transformer**: By introducing the modern Hopfield layer and the multi - head self - attention mechanism, achieve feature extraction, interactive learning and fusion. - **Tracking head and loss function**: Use a fully convolutional network (FCN) for target localization, and adopt a comprehensive loss function combining focal loss, L1 loss and GIoU loss for training. ### Experimental results: - **Performance on the FELT dataset**: AMTTrack reaches 45.5% and 57.2% in the success rate (SR) and precision rate (PR) indicators respectively, outperforming most existing trackers. - **Generalization ability on other datasets**: It also performs well on the RGB - thermal imaging (RGBT234) and RGB - depth (DepthTrack) datasets, verifying the universality and effectiveness of the method. In conclusion, this paper significantly improves the performance of long - time - series frame - event visual tracking by proposing a new dataset and tracking framework, and provides valuable resources and baselines for future research.