Abstract:Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking? In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos. Our method predicts object and part tracks with associated language descriptions in monocular videos, rebuilding the pipeline of Tractor with modern large pre-trained models for static image detection and segmentation: we detect open-vocabulary object instances and propagate their boxes from frame to frame using a flow-based motion model, refine the propagated boxes with the box regression module of the visual detector, and prompt an open-world segmenter with the refined box to segment the objects. We decide the termination of an object track based on the objectness score of the propagated boxes, as well as forward-backward optical flow consistency. We re-identify objects across occlusions using deep feature matching. We show that our model achieves strong performance on multiple established video object segmentation and tracking benchmarks, and can produce reasonable tracks in manipulation data. In particular, our model outperforms previous state-of-the-art in UVO and BURST, benchmarks for open-world object tracking and segmentation, despite never being explicitly trained for tracking. We hope that our approach can serve as a simple and extensible framework for future research.

Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network

OVTrack: Open-Vocabulary Multiple Object Tracking

MCTrack: A Unified 3D Multi-Object Tracking Framework for Autonomous Driving

Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving

InterTrack: Interaction Transformer for 3D Multi-Object Tracking

Opening up Open-World Tracking

VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking

DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes

Monocular Quasi-Dense 3D Object Tracking

3D Multi-object Tracking in Autonomous Driving: A Survey

Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

Tracking Objects with 3D Representation from Videos

VoxelTrack: Exploring Multi-level Voxel Representation for 3D Point Cloud Object Tracking

VoxelTrack: Exploring Voxel Representation for 3D Point Cloud Object Tracking

Know Your Surroundings: Panoramic Multi-Object Tracking by Multimodality Collaboration

RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework

OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

A Tracking-By-Detection Based 3D Multiple Object Tracking for Autonomous Driving