OmniVid: A Generative Framework for Universal Video Understanding

Junke Wang,Dongdong Chen,Chong Luo,Bo He,Lu Yuan,Zuxuan Wu,Yu-Gang Jiang
2024-03-27
Abstract:The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational language models, such as GPT-3, with extensive training corpora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address various types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is the unification and generalization of video understanding tasks. Specifically, the authors aim to handle different types of video understanding tasks (such as action recognition, caption generation, video question - answering, and object tracking, etc.) within one framework by introducing a shared output space. This method draws on the successful experience of large - language models in natural - language processing, using language labels as well as special time and box markers to transform multiple video tasks into token - generation tasks based on video content. ### Main Problem Analysis 1. **Requirements of Diverse Tasks**: - The field of video understanding encompasses multiple subtasks, such as action recognition, video caption generation, video question - answering, and visual object tracking, etc. Each task usually depends on different model architectures and annotation formats, making it difficult to achieve a cross - task general - purpose solution. 2. **Limitations of Existing Methods**: - Most of the existing video - understanding methods are designed for specific tasks. Although they perform well in their respective fields, they are not flexible and efficient enough when facing multi - task requirements. In addition, these methods usually need to customize specific prediction heads for different tasks, increasing the complexity and training difficulty of the model. 3. **Challenges of the Unified Output Space**: - In order to simplify the model design and improve its generalization ability, researchers hope to find a method that enables different video tasks to share the same output space. This not only helps to simplify model training but also promotes broader video - understanding applications. ### Solution The authors propose a generative framework named OmniViD, which solves the above problems in the following ways: - **Introducing a Shared Vocabulary**: By adding time tokens and box tokens to the language vocabulary, OmniViD can represent the outputs of different video tasks. This enhanced vocabulary unifies the output formats of various tasks. - **Unified Encoder - Decoder Architecture**: OmniViD adopts an encoder - decoder architecture, which includes a specialized video encoder and a language encoder for extracting multi - modal features from diverse inputs. In addition, the MQ - former module is introduced to improve the efficiency of video representation. - **Autoregressive Modeling**: By regarding video - understanding tasks as language - modeling tasks based on video content, OmniViD can gradually predict the token sequence during the generation process, thus achieving effective handling of multiple video tasks. ### Experimental Results Through experimental verification on multiple video benchmark datasets, OmniViD has achieved state - of - the - art or competitive results in multiple tasks such as action recognition, video caption generation, video question - answering, dense video caption generation, and visual object tracking. This shows that this framework has significant advantages in achieving the unification of video - understanding tasks. In conclusion, OmniViD successfully unifies multiple video - understanding tasks into one framework by introducing a shared output space and autoregressive modeling, providing a new perspective for achieving more general video understanding.