Abstract:Convolutional Neural Networks (CNNs) and Transformer are two powerful representation learning techniques for visual tracking. Although CNNs can effectively reduce local redundancy via small-neighborhood convolution operations, their limited receptive fields make it difficult to capture global dependency. Self-attention in Transformer uses patches as the input representation, which can effectively capture long-range dependency. However, blind similarity comparisons between all patches can lead to high redundancy. Is there then a technique that combines well the advantages of both paradigms for visual tracking? In this work, we design a novel backbone network for feature extraction. First, we choose Depthwise Convolution and Pointwise Convolution to build a Convolution Mixer, which effectively separates spatial mixing from channel-wise mixing of information. The Convolution Mixer reduces redundancy in spatial and channel features while increasing receptive field. Then, to exploit the global modeling ability of self-attention, we construct a module by aggregating Convolution Mixer and self-attention. The module shares dominant computational complexity (the square of the channel size) in the first stage. In the second stage, the shift and summation operations are lightweight. Finally, to alleviate the overfitting of the backbone network during training, a dropout layer is added at the end of the module to improve the generalization ability of the network model. Stronger image features are provided for subsequent feature fusion and prediction. The proposed tracker (named CMAT) achieves satisfying tracking performance on ten challenging datasets. In particular, CMAT achieves a 64.1% AUC on LaSOT and a 68.9% AUC on UAV123 while running at 23 frames per second (FPS).

Exploiting Weak Mask Representation with Convolutional Neural Networks for Accurate Object Tracking.

Large Margin Object Tracking with Circulant Feature Maps

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

Continuity-Discrimination Convolutional Neural Network for Visual Object Tracking

A Robust Tracking with Low-Dimensional Target-Specific Feature Extraction.

Tracking Randomly Moving Objects on Edge Box Proposals

High Performance Visual Object Tracking with Unified Convolutional Networks

DeepTrack: Learning Discriminative Feature Representations Online for Robust Visual Tracking

RASTMTrack: Robust and Adaptive Space-Time Memory Networks for Visual Tracking

UCT: Learning Unified Convolutional Networks for Real-time Visual Tracking

Toward Accurate Pixelwise Object Tracking via Attention Retrieval

Robust and Accurate Object Tracking under Various Types of Occlusions

Online Video Tracking Using Collaborative Convolutional Networks

Robust Visual Tracking via Convolutional Networks

CMAT: Integrating Convolution Mixer and Self-Attention for Visual Tracking

Fast and Accurate Online Video Object Segmentation Via Tracking Parts.

Exploiting multi-scale hierarchical feature representation for visual tracking

Target-aware transformer tracking with hard occlusion instance generation

Occlusion-Aware Real-Time Object Tracking

Robust Visual Tracking Method via Deep Learning

Dynamically Modulated Mask Sparse Tracking