Abstract:A memory mechanism has attracted growing popularity in tracking tasks due to the ability of learning long-term-dependent information. However, it is very challenging for existing memory modules to provide the intrinsic attribute information of the target to the tracker in complex scenes. In this article, by considering the biological visual memory mechanisms, we propose the novel online tracking method via an attention-driven memory network, which can mine discriminative memory information and enhance the robustness and reliability of the tracker. First, to reinforce effectiveness of memory content, we design a novel attention-driven memory network. In the network, the long memory module gains property-level memory information by focusing on the state of the target at both the channel and spatial levels. Meanwhile, in reciprocity, we add a short-term memory module to maintain good adaptability when confronting drastic deformation of the target. The attention-driven memory network can adaptively adjust the contribution of short-term and long-term memories to tracking results under the weighted gradient harmonized loss. On this basis, to avoid model performance degradation, an online memory updater (MU) is further proposed. It is designed to mining for target information in tracking results through the Mixer layer and the online head network together. By evaluating the confidence of the tracking results, the memory updater can accurately judge the time of updating the model, which guarantees the effectiveness of online memory updates. Finally, the proposed method performs favorably and has been extensively validated on several benchmark datasets, including object tracking benchmark-50/100 (OTB-50/100), temple color-128 (TC-128), unmanned aerial vehicles-123 (UAV-123), generic object tracking -10k (GOT-10k), visual object tracking-2016 (VOT-2016), and VOT-2018 against several advanced methods.

Modeling Human Memory in Multi-Object Tracking with Transformers

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

TransLink: Transformer-Based Embedding for Tracklets’ Global Link

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network

RASTMTrack: Robust and Adaptive Space-Time Memory Networks for Visual Tracking

STMT: Spatio-temporal memory transformer for multi-object tracking

Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment

MOTR: End-to-End Multiple-Object Tracking with Transformer

Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking

Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking

InterTrack: Interaction Transformer for 3D Multi-Object Tracking

TrackFormer: Multi-Object Tracking with Transformers

MAVOT: Memory-Augmented Video Object Tracking

TransCenter: Transformers With Dense Representations for Multiple-Object Tracking

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

Multi-Object Tracking and Segmentation with a Space-Time Memory Network

Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation

Attention-Driven Memory Network for Online Visual Tracking.

TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking