Abstract:Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to locate an arbitrary number of target objects and maintain their identities referred by a language expression in a video. This intricate task involves the reasoning of linguistic and visual modalities, along with the temporal association of target objects. However, the seminal work employs only loose feature fusion and overlooks the utilization of long-term information on tracked objects. In this study, we introduce a compact Transformer-based method, termed TenRMOT. We conduct feature fusion at both encoding and decoding stages to fully exploit the advantages of Transformer architecture. Specifically, we incrementally perform cross-modal fusion layer-by-layer during the encoding phase. In the decoding phase, we utilize language-guided queries to probe memory features for accurate prediction of the desired objects. Moreover, we introduce a query update module that explicitly leverages temporal prior information of the tracked objects to enhance the consistency of their trajectories. In addition, we introduce a novel task called Referring Multi-Object Tracking and Segmentation (RMOTS) and construct a new dataset named Ref-KITTI Segmentation. Our dataset consists of 18 videos with 818 expressions, and each expression averages 10.7 masks, which poses a greater challenge compared to the typical single mask in most existing referring video segmentation datasets. TenRMOT demonstrates superior performance on both the referring multi-object tracking and the segmentation tasks.

MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models

LIMT: Language-Informed Multi-Task Visual World Models

12-in-1: Multi-Task Vision and Language Representation Learning

Proactive Human-Robot Interaction using Visuo-Lingual Transformers

MResTNet: A Multi-Resolution Transformer Framework with CNN Extensions for Semantic Segmentation

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Multi-Modal Fusion in Contact-Rich Precise Tasks via Hierarchical Policy Learning

A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

Temporal and Semantic Evaluation Metrics for Foundation Models in Post-Hoc Analysis of Robotic Sub-tasks

Spatial-Language Attention Policies for Efficient Robot Learning

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation

See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Human-oriented Representation Learning for Robotic Manipulation

Visual-Tactile Multimodality for Following Deformable Linear Objects Using Reinforcement Learning

VIRT: Vision Instructed Transformer for Robotic Manipulation

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

Complementary Multi–Modal Sensor Fusion for Resilient Robot Pose Estimation in Subterranean Environments

ResFormer: Scaling ViTs with Multi-Resolution Training