ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Ali Athar,Xueqing Deng,Liang-Chieh Chen
2024-12-13
Abstract:Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. Project page: <a class="link-external link-https" href="https://ali2500.github.io/vicas-project/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to combine high - level video understanding tasks (such as video caption generation and question - answering) with pixel - level video understanding tasks (such as text - prompt - based video instance segmentation). Currently, these two research directions mostly develop independently, with different benchmarks and architectures, and lack a unified framework to evaluate the model's ability in high - level understanding and fine - grained localization. Specifically, the paper raises the following questions: 1. **Separation between high - level video understanding and pixel - level understanding**: Existing research usually focuses on high - level tasks (such as video caption generation and question - answering) or pixel - level tasks (such as object segmentation), but few works pay attention to both aspects simultaneously. This separation results in the model being unable to fully understand video content, especially in cases where visual and language information need to be combined. 2. **Lack of comprehensive evaluation benchmarks**: Current benchmark datasets either focus on high - level understanding or on pixel - level localization, and there is no unified benchmark that can evaluate the performance in both aspects simultaneously. 3. **Limitations of existing datasets**: Existing datasets either provide detailed text descriptions or pixel - level segmentation masks, but few datasets can provide both high - quality text descriptions and pixel - accurate segmentation masks simultaneously, which limits the model's training and evaluation. To solve these problems, the paper introduces a new dataset named ViCaS (Video Captioning and Segmentation), which contains thousands of videos with detailed human - written captions and temporally consistent, pixel - accurate segmentation masks. The words and phrases in these captions are aligned with key objects, and each video is annotated with segmentation masks for multiple objects. In addition, the paper also proposes a new benchmark task to evaluate the model's ability in high - level understanding and pixel - level understanding, and introduces an effective end - to - end architecture named Video - LLaV A - Seg to handle these tasks. In this way, the paper aims to bridge the gap between high - level video understanding and pixel - level localization and promote the development of more comprehensive video understanding models.