Abstract:Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to localize an arbitrary number of targets based on a language expression and continuously track them in a video. This intricate task involves reasoning on multi-modal data and precise target localization with temporal association. However, prior studies overlook the imbalanced data distribution between newborn targets and existing targets due to the nature of the task. In addition, they only indirectly fuse multi-modal features, struggling to deliver clear guidance on newborn target detection. To solve the above issues, we conduct a collaborative matching strategy to alleviate the impact of the imbalance, boosting the ability to detect newborn targets while maintaining tracking performance. In the encoder, we integrate and enhance the cross-modal and multi-scale fusion, overcoming the bottlenecks in previous work, where limited multi-modal information is shared and interacted between feature maps. In the decoder, we also develop a referring-infused adaptation that provides explicit referring guidance through the query tokens. The experiments showcase the superior performance of our model (+3.42%) compared to prior works, demonstrating the effectiveness of our designs.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are **two major challenges in the Referring Multi - Object Tracking (RMOT) task**: 1. **Data distribution imbalance problem**: In the RMOT task, there is a significant imbalance in the data distribution between newly - emerging targets (newborn targets) and existing targets. Since new targets only activate detection queries once when they first appear, while existing targets continuously activate track queries throughout their life cycle, this leads to insufficient training for new target detection, thus affecting the model performance. 2. **Inadequate multimodal feature fusion**: Existing methods usually indirectly fuse text embeddings with image features and lack direct language guidance. This fusion method cannot effectively transfer semantic information to the decoder, resulting in insufficient support for new target detection and difficulty in understanding complex user intentions. To solve these problems, the author proposes a new framework, which mainly includes the following three key components: 1. **Collaborative Query Matching (CQM)**: By relaxing the matching criteria, the activation frequency of detection queries is increased. During the training process, existing targets can be matched not only with track queries but also with detection queries, thereby increasing the training frequency of new target detection. 2. **Referring - Infused Query Adaptation (RIQA)**: Directly fuse the language description with the query in the decoder to provide explicit semantic guidance and enhance the model's reasoning ability. The RIQA module is implemented in two ways: Pre - Decoder Adaptation and In - Decoder Adaptation. 3. **Cross - Modal Encoder (CME)**: A new encoder is developed to promote information exchange between multimodal and multi - scale features, enhance the effect of feature fusion, and improve the overall performance of the model. Through these improvements, the author has verified the effectiveness of the model on multiple datasets, especially achieving a significant improvement in the HOTA metric, proving the superiority of this method in the RMOT task.

Tell Me What to Track: Infusing Robust Language Guidance for Enhanced Referring Multi-Object Tracking

Split and Connect: A Universal Tracklet Booster for Multi-Object Tracking

Bootstrapping Referring Multi-Object Tracking

Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation

ROMOT: Referring-expression-comprehension open-set multi-object tracking

Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking

TR-MOT: Multi-Object Tracking by Reference

Multi-features Guided Robust Visual Tracking.

Object-Level Pseudo-3D Lifting for Distance-Aware Tracking

MAT: Motion-Aware Multi-Object Tracking

Multi-Granularity Language-Guided Multi-Object Tracking

TLPG-Tracker: Joint Learning of Target Localization and Proposal Generation for Visual Tracking.

MLS-Track: Multilevel Semantic Interaction in RMOT

Rethinking the Competition Between Detection and ReID in Multiobject Tracking

iKUN: Speak to Trackers without Retraining

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation

Real-time Multi-Object Tracking Based on Bi-directional Matching

LaMOT: Language-Guided Multi-Object Tracking

Addressing Challenges of Incorporating Appearance Cues Into Heuristic Multi-Object Tracker via a Novel Feature Paradigm

Visual Object Tracking With Mutual Affinity Aligned to Human Intuition