Abstract:Multimodal visual object tracking (VOT) starts to replace a single modality by providing more complementary details of the target. In RGB plus depth (RGBD) tracking tasks, prevailing trackers extract RGB and depth features with separate feature extraction backbones in parallel structures and fuse the features for target prediction, where RGB is treated as the dominant modality, while the depth information are the assistant one. Their oversight lies in neglecting the distinct information-carrying capacities of each modality, and the fact that RGB and depth features contribute dynamically and variably to the tracking process. In response to these challenges, we have introduced a unified RGBD tracking network called AMATrack, which is designed to simultaneously extract features and investigate interactions between the individual RGB and depth images. Initially, the depth image is integrated with the RGB image to form the fused template and search region separately, ensuring maximal preservation of information from both modalities. Subsequently, we introduce an asymmetric mixed attention (AMA) module within the encoder layers. This module incorporates self-attention for maintaining intramodality features and cross-attention for minimizing discrepancies in intermodality features. Simultaneously, it removes correlations with nontarget patches to prevent adverse effects from background areas. To address the additional computational load during training and inference, we have customized a target token pruning (TTP) technique dubbed TTP. This strategy filters out patch tokens least relevant to the target before generating the final search region, thereby enhancing computational efficiency. In conclusion, our proposed AMATrack attains an F-score of 61.8% and 75.8% on the DepthTrack dataset and CDTB dataset, respectively, while operating at a speed of 73.01 frames/s (FPS). This performance surpasses that of the leading RGBD trackers in meeting real-time demands. These outcomes demonstrate the efficacy and superiority of our proposed network.

Efficient Transformer-based Single-stage RGB-T Tracking with Larger Search Region and Acceleration Mechanism

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

An epidemiological study of nosocomial infections in the patientsadmitted in the intensive care unit of Urmia Imam Reza Hospital: An etiological investigation

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

RGBT Image Fusion Tracking via Sparse Trifurcate Transformer Aggregation Network

Real-Time RGBT Target Tracking Based on Attention Mechanism

Cross-modulated Attention Transformer for RGBT Tracking

RGB-T Tracking Based on Mixed Attention

RGB-T Tracking with Template-Bridged Search Interaction and Target-Preserved Template Updating

[Changes in axial length after scleral buckling surgery].

Learning a Multimodal Feature Transformer for RGBT Tracking

Siamese transformer RGBT tracking

Tracking With Saliency Region Transformer

Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking

High-Performance Transformer Tracking

Efficient transformer tracking with adaptive attention

Transformer-Based Band Regrouping With Feature Refinement for Hyperspectral Object Tracking

AMATrack: A Unified Network With Asymmetric Multimodal Mixed Attention for RGBD Tracking

Lightweight Transformer Tracker: Compact and Effect Neural Network for Object Tracking with Long-Short Range Attention

QueryTrack: Joint-Modality Query Fusion Network for RGBT Tracking