AMATrack: A Unified Network With Asymmetric Multimodal Mixed Attention for RGBD Tracking
Ping Ye,Gang Xiao,Jun Liu
DOI: https://doi.org/10.1109/tim.2024.3421435
IF: 5.6
2024-08-09
IEEE Transactions on Instrumentation and Measurement
Abstract:Multimodal visual object tracking (VOT) starts to replace a single modality by providing more complementary details of the target. In RGB plus depth (RGBD) tracking tasks, prevailing trackers extract RGB and depth features with separate feature extraction backbones in parallel structures and fuse the features for target prediction, where RGB is treated as the dominant modality, while the depth information are the assistant one. Their oversight lies in neglecting the distinct information-carrying capacities of each modality, and the fact that RGB and depth features contribute dynamically and variably to the tracking process. In response to these challenges, we have introduced a unified RGBD tracking network called AMATrack, which is designed to simultaneously extract features and investigate interactions between the individual RGB and depth images. Initially, the depth image is integrated with the RGB image to form the fused template and search region separately, ensuring maximal preservation of information from both modalities. Subsequently, we introduce an asymmetric mixed attention (AMA) module within the encoder layers. This module incorporates self-attention for maintaining intramodality features and cross-attention for minimizing discrepancies in intermodality features. Simultaneously, it removes correlations with nontarget patches to prevent adverse effects from background areas. To address the additional computational load during training and inference, we have customized a target token pruning (TTP) technique dubbed TTP. This strategy filters out patch tokens least relevant to the target before generating the final search region, thereby enhancing computational efficiency. In conclusion, our proposed AMATrack attains an F-score of 61.8% and 75.8% on the DepthTrack dataset and CDTB dataset, respectively, while operating at a speed of 73.01 frames/s (FPS). This performance surpasses that of the leading RGBD trackers in meeting real-time demands. These outcomes demonstrate the efficacy and superiority of our proposed network.
engineering, electrical & electronic,instruments & instrumentation