Abstract:Joint detection and tracking, which solves two fundamental vision challenges in a unified manner, is a challenging topic in computer vision. In this area, the proper use of spatial-temporal information in videos can help reduce local defects and improve the quality of feature representations. Although modeling low-level (usually pixel-wise) spatial-temporal information has been studied, instance-level spatial-temporal correlations (i.e., relations between semantic regions in which instances have occurred) have not been fully exploited. In comparison, modeling instance-level correlation is a more flexible and reasonable way to enhance feature representations. However, we have found that conventional instance-level relation learning that works for the separate tasks of detection or tracking is not effective in joint tasks in which a variety of scenarios may be presented. To try to resolve this problem, in this study, we effectively exploited instance-level spatial-temporal semantic information for joint detection and tracking via a joint relation learning pipeline with a novel relation learning mechanism called Similarity- and Quality-Guided Attention (SQGA). Specifically, we added task-specific SQGA relation modules before the corresponding task prediction heads to refine the instance feature representation using features of other reference instances in the neighboring frames; these features are aggregated on the basis of relational affinities. In particular, in SQGA, relational affinities were factorized to similarity and quality terms so that fine-grained supervision rules could be applied. Then we added task-specific attention losses for each SQGA relation module, resulting in a better feature aggregation for the corresponding task. Quantitative experiments based on several challenging multi-object tracking benchmarks showed that our approach was more effective than the baselines and provided competitive results compared with recent state-of-the-art methods.

Improving Video Concept Detection Using Spatio-Temporal Correlation

Transductive multi-distance learning for video search

A Novel Semantic Model for Video Concept Detection

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification

Video Semantic Concept Detection Based on Conceptual Correlation and Boosting

A Two-View Concept Correlation Based Video Annotation Refinement

Correlative Multilabel Video Annotation with Temporal Kernels

Spatial-then-Temporal Self-Supervised Learning for Video Correspondence.

Temporal-Spatial refinements for video concept fusion

Structure-sensitive manifold ranking for video concept detection.

Exploring Rich and Efficient Spatial Temporal Interactions for Real Time Video Salient Object Detection

Improving bag-of-visual-words model with spatial-temporal correlation for video retrieval

Mining Spatial-Temporal Similarity for Visual Tracking.

Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context

Correlative Linear Neighborhood Propagation for Video Annotation

Similarity- and Quality-Guided Relation Learning for Joint Detection and Tracking

Video-based Salient Object Detection Via Spatio-Temporal Difference and Coherence

Structure-sensitive manifold ranking for video concept detection

Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation