Siamese-DETR for Generic Multi-Object Tracking

Qiankun Liu,Yichen Li,Yuqi Jiang,Ying Fu

2024-06-15

Abstract:The ability to detect and track the dynamic objects in different scenes is fundamental to real-world applications, e.g., autonomous driving and robot navigation. However, traditional Multi-Object Tracking (MOT) is limited to tracking objects belonging to the pre-defined closed-set categories. Recently, Open-Vocabulary MOT (OVMOT) and Generic MOT (GMOT) are proposed to track interested objects beyond pre-defined categories with the given text prompt and template image. However, the expensive well pre-trained (vision-)language model and fine-grained category annotations are required to train OVMOT models. In this paper, we focus on GMOT and propose a simple but effective method, Siamese-DETR, for GMOT. Only the commonly used detection datasets (e.g., COCO) are required for training. Different from existing GMOT methods, which train a Single Object Tracking (SOT) based detector to detect interested objects and then apply a data association based MOT tracker to get the trajectories, we leverage the inherent object queries in DETR variants. Specifically: 1) The multi-scale object queries are designed based on the given template image, which are effective for detecting different scales of objects with the same category as the template image; 2) A dynamic matching training strategy is introduced to train Siamese-DETR on commonly used detection datasets, which takes full advantage of provided annotations; 3) The online tracking pipeline is simplified through a tracking-by-query manner by incorporating the tracked boxes in previous frame as additional query boxes. The complex data association is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive experimental results show that Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses the issues encountered in Multi-Object Tracking (MOT) in real-world applications, particularly in scenarios such as autonomous driving and robot navigation. Traditional MOT methods are usually limited to a predefined set of closed categories (e.g., pedestrians, cars), which restricts their generalization ability in open environments. To solve this problem, the paper proposes Generic Multi-Object Tracking (GMOT). GMOT is divided into two categories: 1. **Open-Vocabulary Multi-Object Tracking (OVMOT)**: Utilizes text prompts (e.g., category names) for description. 2. **Template-Image-based Multi-Object Tracking (TIMOT)**: Utilizes template images for description. Since OVMOT requires a large amount of training data and complex language models, this paper focuses on TIMOT and proposes a simple yet effective solution—Siamese-DETR. #### Features of Siamese-DETR: 1. **Multi-Scale Object Queries (MSOQ)**: Generates multi-scale object queries based on the given template image to detect objects of the same category at different scales. 2. **Dynamic Matching Training Strategy (DMTS)**: Trains on commonly used object detection datasets (e.g., COCO) through a dynamic matching training strategy. 3. **Tracking-by-Query (TbQ)**: Uses the tracking boxes from the previous frame as additional query boxes, simplifying the tracking process. With these designs, Siamese-DETR significantly outperforms existing methods on the GMOT-40 dataset.

Siamese-DETR for Generic Multi-Object Tracking

DIOR - DIstill Observations to Representations for Multi-Object Tracking and Segmentation.

MAT: Motion-Aware Multi-Object Tracking

Multi-object tracking with Siamese-RPN and adaptive matching strategy

Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking

Towards Real-Time Multi-Object Tracking

TP-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT

MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors

Simultaneous Detection and Tracking with Motion Modelling for Multiple Object Tracking

Poly-MOT: A Polyhedral Framework for 3D Multi-Object Tracking.

Z-GMOT: Zero-shot Generic Multiple Object Tracking

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

TR-MOT: Multi-Object Tracking by Reference

Improving Multiple Object Tracking with Single Object Tracking

Multi-Granularity Language-Guided Multi-Object Tracking

RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation

Multimodal Multiobject Tracking by Fusing Deep Appearance Features and Motion Information

Multi-Object Tracking by Self-supervised Learning Appearance Model.

SOT for MOT

Transformer-Based Multiple-Object Tracking via Anchor-Based-Query and Template Matching

Trajectory Factory: Tracklet Cleaving and Re-connection by Deep Siamese Bi-GRU for Multiple Object Tracking