Beyond MOT: Semantic Multi-Object Tracking

Yunhao Li,Qin Li,Hao Wang,Xue Ma,Jiali Yao,Shaohua Dong,Heng Fan,Libo Zhang
2024-07-29
Abstract:Current multi-object tracking (MOT) aims to predict trajectories of targets (i.e., ''where'') in videos. Yet, knowing merely ''where'' is insufficient in many crucial applications. In comparison, semantic understanding such as fine-grained behaviors, interactions, and overall summarized captions (i.e., ''what'') from videos, associated with ''where'', is highly-desired for comprehensive video analysis. Thus motivated, we introduce Semantic Multi-Object Tracking (SMOT), that aims to estimate object trajectories and meanwhile understand semantic details of associated trajectories including instance captions, instance interactions, and overall video captions, integrating ''where'' and ''what'' for tracking. In order to foster the exploration of SMOT, we propose BenSMOT, a large-scale Benchmark for Semantic MOT. Specifically, BenSMOT comprises 3,292 videos with 151K frames, covering various scenarios for semantic tracking of humans. BenSMOT provides annotations for the trajectories of targets, along with associated instance captions in natural language, instance interactions, and overall caption for each video sequence. To our best knowledge, BenSMOT is the first publicly available benchmark for SMOT. Besides, to encourage future research, we present a novel tracker named SMOTer, which is specially designed and end-to-end trained for SMOT, showing promising performance. By releasing BenSMOT, we expect to go beyond conventional MOT by predicting ''where'' and ''what'' for SMOT, opening up a new direction in tracking for video understanding. We will release BenSMOT and SMOTer at <a class="link-external link-https" href="https://github.com/Nathan-Li123/SMOTer" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing a new problem in the field of Multi-Object Tracking (MOT) — Semantic Multi-Object Tracking (SMOT). Traditional MOT only focuses on predicting the positional trajectories of targets (i.e., "where"), while SMOT further extends this task by aiming to not only predict the positional trajectories of targets but also understand the semantic details related to these trajectories (i.e., "what"). Specifically, the goals of SMOT include the following aspects: 1. **Instance Description**: Using natural language to describe the behavior and actions of each tracked target. 2. **Instance Interaction Recognition**: Identifying the relationships and interactions between targets. 3. **Overall Video Description**: Providing an overall description of the entire video sequence, summarizing what happens in the video. To advance the research in SMOT, the authors have proposed a large-scale benchmark dataset named BenSMOT, which contains 3,292 video clips and over 150,000 annotated frames. These videos cover various everyday scenes, such as outdoor basketball courts, to support human-centric SMOT research. Additionally, the paper proposes a new tracker named SMOTer, which is a model specifically designed for SMOT and capable of end-to-end training. SMOTer can not only predict the trajectories of targets but also understand the semantic information of these trajectories. By releasing the BenSMOT dataset and the SMOTer model, the authors hope to push forward the research in the field of video understanding, surpassing the traditional multi-object tracking tasks.