Abstract:Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. Project page: <a class="link-external link-https" href="https://ali2500.github.io/vicas-project/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to combine high - level video understanding tasks (such as video caption generation and question - answering) with pixel - level video understanding tasks (such as text - prompt - based video instance segmentation). Currently, these two research directions mostly develop independently, with different benchmarks and architectures, and lack a unified framework to evaluate the model's ability in high - level understanding and fine - grained localization. Specifically, the paper raises the following questions: 1. **Separation between high - level video understanding and pixel - level understanding**: Existing research usually focuses on high - level tasks (such as video caption generation and question - answering) or pixel - level tasks (such as object segmentation), but few works pay attention to both aspects simultaneously. This separation results in the model being unable to fully understand video content, especially in cases where visual and language information need to be combined. 2. **Lack of comprehensive evaluation benchmarks**: Current benchmark datasets either focus on high - level understanding or on pixel - level localization, and there is no unified benchmark that can evaluate the performance in both aspects simultaneously. 3. **Limitations of existing datasets**: Existing datasets either provide detailed text descriptions or pixel - level segmentation masks, but few datasets can provide both high - quality text descriptions and pixel - accurate segmentation masks simultaneously, which limits the model's training and evaluation. To solve these problems, the paper introduces a new dataset named ViCaS (Video Captioning and Segmentation), which contains thousands of videos with detailed human - written captions and temporally consistent, pixel - accurate segmentation masks. The words and phrases in these captions are aligned with key objects, and each video is annotated with segmentation masks for multiple objects. In addition, the paper also proposes a new benchmark task to evaluate the model's ability in high - level understanding and pixel - level understanding, and introduces an effective end - to - end architecture named Video - LLaV A - Seg to handle these tasks. In this way, the paper aims to bridge the gap between high - level video understanding and pixel - level localization and promote the development of more comprehensive video understanding models.

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Global and Compact Video Context Embedding for Video Semantic Segmentation

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

A Dataset with Multi-Modal Information and Multi-Granularity Descriptions for Video Captioning

Towards Open-Vocabulary Video Instance Segmentation

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Msr-Vtt: A Large Video Description Dataset for Bridging Video and Language

Towards Open-Vocabulary Video Semantic Segmentation

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Edit As You Wish: Video Caption Editing with Multi-grained User Control

OSCaR: Object State Captioning and State Change Representation

Occluded Video Instance Segmentation: A Benchmark

Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Holistic Large Scale Video Understanding

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset