Abstract:Visual tracking is a challenging task due to unconstrained appearance variations and dynamic surrounding backgrounds, which basically arise from the complex motion of the target object. Therefore, the information and the correlation between the target motion and its resulting appearance should be considered comprehensively to achieve robust tracking performance. In this paper, we propose a deep neural network for visual tracking, namely the Motion-Appearance Dual (MADual) network, which employs a dual-branch architecture, by using deep two-dimensional (2D) and deep three-dimensional (3D) convolutions to integrate the local and global information of the target object's motion and appearance synchronously. For each frame of a tracking video, 2D convolutional kernels of the deep 2D branch slide over the frame to extract its global spatial-appearance features. Meanwhile, 3D convolutional kernels of the deep 3D branch are used to collaboratively extract the appearance and the associated motion features of the visual target from successive frames. By sliding the 3D convolutional kernels along a video sequence, the model is able to learn the temporal features from previous frames, and therefore, generate the local patch-based motion patterns of the target. Sliding the 2D kernels on a frame and the 3D kernels on a frame cube synchronously enables a better hierarchical motion-appearance integration, and boosts the performance for the visual tracking task. To further improve the tracking precision, an extra ridge-regression model is trained for the tracking process, based not only on the bounding box given in the first frame, but also on its synchro-frame-cube using our proposed Inverse Temporal Training method (ITT). Extensive experiments on popular benchmark datasets, OTB2013, OTB50, OTB2015, UAV123, TC128, VOT2015 and VOT2016, demonstrate that the proposed MADual tracker performs favorably against many state-of-the-art methods.

Joint Learning Appearance and Motion Models for Visual Tracking

Track Without Appearance: Learn Box and Tracklet Embedding with Local and Global Motion Patterns for Vehicle Tracking

Real-time visual tracking based on an appearance model and a motion mode

Deep Motion-Appearance Convolutions for Robust Visual Tracking

Robust Visual Tracking Via Collaborative Motion and Appearance Model

Online Learning and Joint Optimization of Combined Spatial-Temporal Models for Robust Visual Tracking.

Multi-object Model-Free Tracking with Joint Appearance and Motion Inference

Motion-Driven Tracking via End-to-End Coarse-to-Fine Verifying

End-to-end Visual Object Tracking with Motion Saliency Guidance

Deep Learning of Appearance Models for Online Object Tracking

Robust Visual Tracking Using Multi-Frame Multi-Feature Joint Modeling.

Model-Free Tracker for Multiple Objects Using Joint Appearance and Motion Inference

Deep Flow Collaborative Network for Online Visual Tracking.

Robust Joint Discriminative Feature Learning for Visual Tracking

Robust Visual Tracking Via Spatio-Temporal Cue Integration

Long-Term Visual Object Tracking Via Continual Learning

Visual Tracking with Long-Short Term Based Correlation Filter

Visual Tracking by Appearance Modeling and Sparse Representation

Robust Visual Tracking Using Information Theoretical Learning

A Flow-Guided Self-Calibration Siamese Network for Visual Tracking.

Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking