Abstract:Joint detection and tracking, which solves two fundamental vision challenges in a unified manner, is a challenging topic in computer vision. In this area, the proper use of spatial-temporal information in videos can help reduce local defects and improve the quality of feature representations. Although modeling low-level (usually pixel-wise) spatial-temporal information has been studied, instance-level spatial-temporal correlations (i.e., relations between semantic regions in which instances have occurred) have not been fully exploited. In comparison, modeling instance-level correlation is a more flexible and reasonable way to enhance feature representations. However, we have found that conventional instance-level relation learning that works for the separate tasks of detection or tracking is not effective in joint tasks in which a variety of scenarios may be presented. To try to resolve this problem, in this study, we effectively exploited instance-level spatial-temporal semantic information for joint detection and tracking via a joint relation learning pipeline with a novel relation learning mechanism called Similarity- and Quality-Guided Attention (SQGA). Specifically, we added task-specific SQGA relation modules before the corresponding task prediction heads to refine the instance feature representation using features of other reference instances in the neighboring frames; these features are aggregated on the basis of relational affinities. In particular, in SQGA, relational affinities were factorized to similarity and quality terms so that fine-grained supervision rules could be applied. Then we added task-specific attention losses for each SQGA relation module, resulting in a better feature aggregation for the corresponding task. Quantitative experiments based on several challenging multi-object tracking benchmarks showed that our approach was more effective than the baselines and provided competitive results compared with recent state-of-the-art methods.

Similarity- and Quality-Guided Relation Learning for Joint Detection and Tracking

Supplementary Material: Quasi-Dense Similarity Learning for Multiple Object Tracking

Visual Tracking Based on Semantic and Similarity Learning

Joint Identification-Verification Model For Visual Tracking

Joint Spatio-Temporal Similarity and Discrimination Learning for Visual Tracking

TLPG-Tracker: Joint Learning of Target Localization and Proposal Generation for Visual Tracking.

SiamSGA: Siamese Symmetric Graph Attention Tracking

Relation Learning Reasoning Meets Tiny Object Tracking in Satellite Videos

Quasi-Dense Similarity Learning for Multiple Object Tracking

Learning Spatially Regularized Similarity for Robust Visual Tracking.

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

Improving Video Concept Detection Using Spatio-Temporal Correlation

Visual Object Tracking Via Non-Local Correlation Attention Learning

Joint Feature Correspondences and Appearance Similarity for Robust Visual Object Tracking

Learning Fine-Grained Similarity Matching Networks for Visual Tracking

Learning Bi-Grained Cross-Correlation Siamese Networks for Visual Tracking

Video Visual Relation Detection Via Multi-modal Feature Fusion

3-D Relation Network for Visual Relation Recognition in Videos

Attention Guided Relation Detection Approach for Video Visual Relation Detection

Relation Understanding in Videos

Online Learning and Joint Optimization of Combined Spatial-Temporal Models for Robust Visual Tracking.