Video OWL-ViT: Temporally-consistent open-world localization in video

Georg Heigold,Matthias Minderer,Alexey Gritsenko,Alex Bewley,Daniel Keysers,Mario Lučić,Fisher Yu,Thomas Kipf

2023-08-22

Abstract:We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper primarily aims to address the following issues: 1. **How to transfer open-vocabulary capabilities from image models to video for object detection and tracking**: Researchers aim to leverage models pre-trained on large-scale image-text datasets to handle object detection and tracking tasks in videos, especially in open-world scenarios where the model needs to recognize object categories that were not present in the training set. 2. **How to effectively transfer models to adapt to video data**: Since the amount of task-specific video data is usually limited, a method is needed to efficiently utilize the existing image-level pre-trained knowledge and transfer it to video tasks. 3. **Improving temporal consistency**: In videos, objects move over time, so a model is needed that can maintain consistent object representation across frames. This requires the model to not only detect objects but also track their positional changes. To address the above challenges, the paper proposes the **Video OWL-ViT** model, which is an extension based on the **OWL-ViT** architecture. By introducing a Transformer decoder to decouple the relationship between object representations and the image grid, it achieves cross-frame object tracking. Specifically, the main contributions of Video OWL-ViT include: - Extending the **OWL-ViT** architecture to video tasks by adding a Transformer decoder to propagate object representations, enabling continuous tracking of objects in video sequences. - Fine-tuning on limited video data to retain the model's open-vocabulary detection capabilities. - Demonstrating the model's strong performance on the challenging TAO-OW dataset, particularly excelling in detecting and tracking unseen object categories. The paper also details the model's design, training strategies, and experimental results, proving that the proposed method can effectively solve the aforementioned problems.

Video OWL-ViT: Temporally-consistent open-world localization in video

Unsupervised Open-Vocabulary Object Localization in Videos

Video Instance Segmentation in an Open-World

Towards Open-Vocabulary Video Instance Segmentation

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Scaling Open-Vocabulary Object Detection

Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

Object-aware Video-language Pre-training for Retrieval

Towards Open-Vocabulary Video Semantic Segmentation

Open-Vocabulary Temporal Action Localization using Multimodal Guidance

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

OV-VG: A benchmark for open-vocabulary visual grounding

OpenVIS: Open-vocabulary Video Instance Segmentation

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

DeTAL: Open-Vocabulary Temporal Action Localization with Decoupled Networks

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos

Hyperbolic Learning with Synthetic Captions for Open-World Detection

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Learning Object State Changes in Videos: An Open-World Perspective

OVExp: Open Vocabulary Exploration for Object-Oriented Navigation