Video OWL-ViT: Temporally-consistent open-world localization in video

Georg Heigold,Matthias Minderer,Alexey Gritsenko,Alex Bewley,Daniel Keysers,Mario Lučić,Fisher Yu,Thomas Kipf
2023-08-22
Abstract:We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper primarily aims to address the following issues: 1. **How to transfer open-vocabulary capabilities from image models to video for object detection and tracking**: Researchers aim to leverage models pre-trained on large-scale image-text datasets to handle object detection and tracking tasks in videos, especially in open-world scenarios where the model needs to recognize object categories that were not present in the training set. 2. **How to effectively transfer models to adapt to video data**: Since the amount of task-specific video data is usually limited, a method is needed to efficiently utilize the existing image-level pre-trained knowledge and transfer it to video tasks. 3. **Improving temporal consistency**: In videos, objects move over time, so a model is needed that can maintain consistent object representation across frames. This requires the model to not only detect objects but also track their positional changes. To address the above challenges, the paper proposes the **Video OWL-ViT** model, which is an extension based on the **OWL-ViT** architecture. By introducing a Transformer decoder to decouple the relationship between object representations and the image grid, it achieves cross-frame object tracking. Specifically, the main contributions of Video OWL-ViT include: - Extending the **OWL-ViT** architecture to video tasks by adding a Transformer decoder to propagate object representations, enabling continuous tracking of objects in video sequences. - Fine-tuning on limited video data to retain the model's open-vocabulary detection capabilities. - Demonstrating the model's strong performance on the challenging TAO-OW dataset, particularly excelling in detecting and tracking unseen object categories. The paper also details the model's design, training strategies, and experimental results, proving that the proposed method can effectively solve the aforementioned problems.