Beyond MOT: Semantic Multi-Object Tracking

Yunhao Li,Qin Li,Hao Wang,Xue Ma,Jiali Yao,Shaohua Dong,Heng Fan,Libo Zhang

2024-07-29

Abstract:Current multi-object tracking (MOT) aims to predict trajectories of targets (i.e., ''where'') in videos. Yet, knowing merely ''where'' is insufficient in many crucial applications. In comparison, semantic understanding such as fine-grained behaviors, interactions, and overall summarized captions (i.e., ''what'') from videos, associated with ''where'', is highly-desired for comprehensive video analysis. Thus motivated, we introduce Semantic Multi-Object Tracking (SMOT), that aims to estimate object trajectories and meanwhile understand semantic details of associated trajectories including instance captions, instance interactions, and overall video captions, integrating ''where'' and ''what'' for tracking. In order to foster the exploration of SMOT, we propose BenSMOT, a large-scale Benchmark for Semantic MOT. Specifically, BenSMOT comprises 3,292 videos with 151K frames, covering various scenarios for semantic tracking of humans. BenSMOT provides annotations for the trajectories of targets, along with associated instance captions in natural language, instance interactions, and overall caption for each video sequence. To our best knowledge, BenSMOT is the first publicly available benchmark for SMOT. Besides, to encourage future research, we present a novel tracker named SMOTer, which is specially designed and end-to-end trained for SMOT, showing promising performance. By releasing BenSMOT, we expect to go beyond conventional MOT by predicting ''where'' and ''what'' for SMOT, opening up a new direction in tracking for video understanding. We will release BenSMOT and SMOTer at <a class="link-external link-https" href="https://github.com/Nathan-Li123/SMOTer" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily focuses on addressing a new problem in the field of Multi-Object Tracking (MOT) — Semantic Multi-Object Tracking (SMOT). Traditional MOT only focuses on predicting the positional trajectories of targets (i.e., "where"), while SMOT further extends this task by aiming to not only predict the positional trajectories of targets but also understand the semantic details related to these trajectories (i.e., "what"). Specifically, the goals of SMOT include the following aspects: 1. **Instance Description**: Using natural language to describe the behavior and actions of each tracked target. 2. **Instance Interaction Recognition**: Identifying the relationships and interactions between targets. 3. **Overall Video Description**: Providing an overall description of the entire video sequence, summarizing what happens in the video. To advance the research in SMOT, the authors have proposed a large-scale benchmark dataset named BenSMOT, which contains 3,292 video clips and over 150,000 annotated frames. These videos cover various everyday scenes, such as outdoor basketball courts, to support human-centric SMOT research. Additionally, the paper proposes a new tracker named SMOTer, which is a model specifically designed for SMOT and capable of end-to-end training. SMOTer can not only predict the trajectories of targets but also understand the semantic information of these trajectories. By releasing the BenSMOT dataset and the SMOTer model, the authors hope to push forward the research in the field of video understanding, surpassing the traditional multi-object tracking tasks.

Beyond MOT: Semantic Multi-Object Tracking

SMOT: Single-Shot Multi Object Tracking

LaMOT: Language-Guided Multi-Object Tracking

MAT: Motion-Aware Multi-Object Tracking

Tracking Small and Fast Moving Objects: A Benchmark

Awesome Multi-modal Object Tracking

Tracking Every Thing in the Wild.

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

MOTS: Multi-Object Tracking and Segmentation

SOT for MOT

[Significance of cardiovascular research within the scope of the total development of medical sciences in East Germany].

Exploring the State-of-the-Art in Multi-Object Tracking: A Comprehensive Survey, Evaluation, Challenges, and Future Directions

SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes

Towards Real-Time Multi-Object Tracking

STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking

PointTrack++ for Effective Online Multi-Object Tracking and Segmentation

Simultaneous Detection and Tracking with Motion Modelling for Multiple Object Tracking

TOPIC: A Parallel Association Paradigm for Multi-Object Tracking under Complex Motions and Diverse Scenes

Towards Generalizable Multi-Object Tracking

Z-GMOT: Zero-shot Generic Multiple Object Tracking

Comprehensive molecular analysis demonstrates type V collagen mutations in over 90% of patients with classic EDS and allows to refine diagnostic criteria