MambaLCT: Boosting Tracking via Long-term Context State Space Model

Xiaohai Li,Bineng Zhong,Qihua Liang,Guorong Li,Zhiyi Mo,Shuxiang Song
2024-12-18
Abstract:Effectively constructing context information with long-term dependencies from video sequences is crucial for object tracking. However, the context length constructed by existing work is limited, only considering object information from adjacent frames or video clips, leading to insufficient utilization of contextual information. To address this issue, we propose MambaLCT, which constructs and utilizes target variation cues from the first frame to the current frame for robust tracking. First, a novel unidirectional Context Mamba module is designed to scan frame features along the temporal dimension, gathering target change cues throughout the entire sequence. Specifically, target-related information in frame features is compressed into a hidden state space through selective scanning mechanism. The target information across the entire video is continuously aggregated into target variation cues. Next, we inject the target change cues into the attention mechanism, providing temporal information for modeling the relationship between the template and search frames. The advantage of MambaLCT is its ability to continuously extend the length of the context, capturing complete target change cues, which enhances the stability and robustness of the tracker. Extensive experiments show that long-term context information enhances the model's ability to perceive targets in complex scenarios. MambaLCT achieves new SOTA performance on six benchmarks while maintaining real-time running speeds.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to effectively construct and utilize long - term contextual information in video sequences to improve the stability and robustness of object tracking**. ### Problem Background In visual tracking tasks, the appearance and motion state of the target may change in long - time and complex sequences. However, traditional tracking algorithms usually rely only on the initial frame to locate the target in subsequent frames, lacking an understanding of the target changes in the entire sequence, which leads to a decline in the stability and robustness of the tracker. Specifically: - **Short - term contextual information**: Existing methods mainly construct contextual information through inter - frame propagation or frames within a fixed - length window. These methods can only capture the target change information in adjacent frames or within a short - time window and cannot fully utilize historical information. - **Computational complexity**: Although the Transformer performs well in learning appearance features, its quadratic computational complexity limits the length of the context it can handle. ### Solution To solve the above problems, the authors propose the **MambaLCT** model, which aims to construct and utilize long - term contextual information in the following ways: 1. **Context Mamba module**: A unidirectional Context Mamba module is designed to scan frame features along the time dimension and collect target change cues in the entire sequence. Specifically, target - related information is compressed into the hidden state space and continuously aggregated through a selective scanning mechanism. 2. **Attention mechanism injection**: Inject target change cues into the attention mechanism to provide temporal information for modeling the relationship between the template frame and the search frame. 3. **Long - term contextual information expansion**: MambaLCT can continuously expand the context length, capture complete target change cues, and thus enhance the stability and robustness of the tracker. ### Main Contributions - **Effectively capture long - term behavior**: MambaLCT can effectively capture the long - term behavior and overall motion of the target, providing more comprehensive contextual information. - **Construct long - term context with low resource consumption**: The Context Mamba module can construct long - term contextual information with low resource consumption. - **SOTA performance**: This method achieves new state - of - the - art performance on six visual tracking benchmarks (LaSOT, LaSOT ext, GOT - 10K, TrackingNet, TNL2K, and UAV123) while maintaining real - time running speed. ### Formula Representation To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper: 1. **State Space Model (SSM) mapping process**: \[ h'(t)=A h(t)+B x(t), \] \[ y(t)=C h(t), \] where \( h(t) \) is the hidden state, \( A \) is the evolution parameter, and \( B \) and \( C \) are projection parameters. 2. **Mapping process after discretization**: \[ h_t = A h_{t - 1}+B x_t, \] \[ y_t = C h_t. \] 3. **Loss function**: \[ L = L_{\text{cls}}+\lambda_1 L_1+\lambda_2 L_{\text{GIoU}}, \] where \( \lambda_1 \) and \( \lambda_2 \) are manually set loss weights. Through these improvements, MambaLCT can better perceive the target in complex scenarios and improve the accuracy and stability of tracking.