Abstract:Intelligently tracking objects with varied shapes, color, lighting conditions, and backgrounds is an extremely useful application in many HCI applications, such as human body motion capture, hand gesture recognition, and virtual reality (VR) games. However, accurately tracking different objects under uncontrolled environments is a tough challenge due to the possibly dynamic object parts, varied lighting conditions, and sophisticated backgrounds. In this work, we propose a novel semantically-aware object tracking framework, wherein the key is weakly-supervised learning paradigm that optimally transfers the video-level semantic tags into various regions. More specifically, give a set of training video clips, each of which is associated with multiple video-level semantic tags, we first propose a weakly-supervised learning algorithm to transfer the semantic tags into various video regions. The key is a MIL (Zhong et al., 2020) [1]-based manifold embedding algorithm that maps the entire video regions into a semantic space, wherein the video-level semantic tags are well encoded. Afterward, for each video region, we use the semantic feature combined with the appearance feature as its representation. We designed a multi-view learning algorithm to optimally fuse the above two types of features. Based on the fused feature, we learn a probabilistic Gaussian mixture model to predict the target probability of each candidate window, where the window with the maximal probability is output as the tracking result. Comprehensive comparative results on a challenging pedestrian tracking task as well as the human hand gesture recognition have demonstrated the effectiveness of our method. Moreover, visualized tracking results have shown that non-rigid objects with moderate occlusions can be well localized by our method.

Crop-Transform-Paste: Self-Supervised Learning for Visual Tracking.

Self-Supervised Tracking via Target-Aware Data Synthesis

Exploiting Temporal Coherence for Self-Supervised Visual Tracking by Using Vision Transformer

Consistency-based Self-Supervised Visual Tracking by Using Query-Communication Transformer.

Empirical Study of Unsupervised Pre-Training in CNN and Transformer Based Visual Tracking

Online Object Tracking Based on CNN with Spatial-Temporal Saliency Guided Sampling

Multi-features Guided Robust Visual Tracking.

Learning a Visual Tracker from a Single Movie Without Annotation

Unsupervised Deep Tracking

SslTransT: Self-supervised pre-training visual object tracking with Transformers

Unsupervised Deep Representation Learning for Real-Time Tracking

CMAT: Integrating Convolution Mixer and Self-Attention for Visual Tracking

Long-Term Visual Object Tracking Via Continual Learning

Multiple Instance Deep Learning for Weakly-Supervised Visual Object Tracking

Learning to Track Objects from Unlabeled Videos.

Self-paced Model Learning for Robust Visual Tracking

Unsupervised Learning of Accurate Siamese Tracking

Self-supervised Discriminative Model Prediction for Visual Tracking

Online Unsupervised Feature Learning for Visual Tracking

Object Tracking by Transitive Learning Using Perspective Transformation

Robust Visual Tracking Method via Deep Learning