Tell Me What to Track: Infusing Robust Language Guidance for Enhanced Referring Multi-Object Tracking

Wenjun Huang,Yang Ni,Hanning Chen,Yirui He,Ian Bryant,Yezi Liu,Mohsen Imani
2024-12-17
Abstract:Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to localize an arbitrary number of targets based on a language expression and continuously track them in a video. This intricate task involves reasoning on multi-modal data and precise target localization with temporal association. However, prior studies overlook the imbalanced data distribution between newborn targets and existing targets due to the nature of the task. In addition, they only indirectly fuse multi-modal features, struggling to deliver clear guidance on newborn target detection. To solve the above issues, we conduct a collaborative matching strategy to alleviate the impact of the imbalance, boosting the ability to detect newborn targets while maintaining tracking performance. In the encoder, we integrate and enhance the cross-modal and multi-scale fusion, overcoming the bottlenecks in previous work, where limited multi-modal information is shared and interacted between feature maps. In the decoder, we also develop a referring-infused adaptation that provides explicit referring guidance through the query tokens. The experiments showcase the superior performance of our model (+3.42%) compared to prior works, demonstrating the effectiveness of our designs.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve are **two major challenges in the Referring Multi - Object Tracking (RMOT) task**: 1. **Data distribution imbalance problem**: In the RMOT task, there is a significant imbalance in the data distribution between newly - emerging targets (newborn targets) and existing targets. Since new targets only activate detection queries once when they first appear, while existing targets continuously activate track queries throughout their life cycle, this leads to insufficient training for new target detection, thus affecting the model performance. 2. **Inadequate multimodal feature fusion**: Existing methods usually indirectly fuse text embeddings with image features and lack direct language guidance. This fusion method cannot effectively transfer semantic information to the decoder, resulting in insufficient support for new target detection and difficulty in understanding complex user intentions. To solve these problems, the author proposes a new framework, which mainly includes the following three key components: 1. **Collaborative Query Matching (CQM)**: By relaxing the matching criteria, the activation frequency of detection queries is increased. During the training process, existing targets can be matched not only with track queries but also with detection queries, thereby increasing the training frequency of new target detection. 2. **Referring - Infused Query Adaptation (RIQA)**: Directly fuse the language description with the query in the decoder to provide explicit semantic guidance and enhance the model's reasoning ability. The RIQA module is implemented in two ways: Pre - Decoder Adaptation and In - Decoder Adaptation. 3. **Cross - Modal Encoder (CME)**: A new encoder is developed to promote information exchange between multimodal and multi - scale features, enhance the effect of feature fusion, and improve the overall performance of the model. Through these improvements, the author has verified the effectiveness of the model on multiple datasets, especially achieving a significant improvement in the HOTA metric, proving the superiority of this method in the RMOT task.