Abstract:Convolutional Neural Networks (CNNs) and Transformer are two powerful representation learning techniques for visual tracking. Although CNNs can effectively reduce local redundancy via small-neighborhood convolution operations, their limited receptive fields make it difficult to capture global dependency. Self-attention in Transformer uses patches as the input representation, which can effectively capture long-range dependency. However, blind similarity comparisons between all patches can lead to high redundancy. Is there then a technique that combines well the advantages of both paradigms for visual tracking? In this work, we design a novel backbone network for feature extraction. First, we choose Depthwise Convolution and Pointwise Convolution to build a Convolution Mixer, which effectively separates spatial mixing from channel-wise mixing of information. The Convolution Mixer reduces redundancy in spatial and channel features while increasing receptive field. Then, to exploit the global modeling ability of self-attention, we construct a module by aggregating Convolution Mixer and self-attention. The module shares dominant computational complexity (the square of the channel size) in the first stage. In the second stage, the shift and summation operations are lightweight. Finally, to alleviate the overfitting of the backbone network during training, a dropout layer is added at the end of the module to improve the generalization ability of the network model. Stronger image features are provided for subsequent feature fusion and prediction. The proposed tracker (named CMAT) achieves satisfying tracking performance on ten challenging datasets. In particular, CMAT achieves a 64.1% AUC on LaSOT and a 68.9% AUC on UAV123 while running at 23 frames per second (FPS).

Leveraging Local and Global Cues for Visual Tracking Via Parallel Interaction Network

Transformer Union Convolution Network for Visual Object Tracking

Bidirectional Interaction of CNN and Transformer Feature for Visual Tracking

CVTrack: Combined Convolutional Neural Network and Vision Transformer Fusion Model for Visual Tracking

Online Object Tracking Based on CNN with Spatial-Temporal Saliency Guided Sampling

ACSiamRPN: Adaptive Context Sampling for Visual Object Tracking

A Location-Aware Siamese Network for High-Speed Visual Tracking

Global-local feature-mixed network with template update for visual tracking

A Robust Attention-Enhanced Network with Transformer for Visual Tracking.

LGTrack: Exploiting Local and Global Properties for Robust Visual Tracking

Local to Global Tracker: A Siamese Network for Long-term Tracking

SiamDAG: Siamese Dynamic Receptive Field and Global Context Modeling Network for Visual Tracking.

Exploiting spatial relationships for visual tracking

Adaptive Decision-Level Fusion and Complementary Mining for Visual Object Tracking with Deeper Networks.

Visual tracking using transformer with a combination of convolution and attention

CTT: CNN Meets Transformer for Tracking

CMAT: Integrating Convolution Mixer and Self-Attention for Visual Tracking

Nocal-Siam: Refining Visual Features and Response With Advanced Non-Local Blocks for Real-Time Siamese Tracking

Region-based High-resolution Siamese Network for Robust Visual Tracking

CRTrack: Learning Correlation-Refine network for visual object tracking

ACTrack: Visual Tracking with K-est Attention and LG Convolution