ViLLa: Video Reasoning Segmentation with Large Language Model

Rongkun Zheng,Lu Qi,Xi Chen,Yi Wang,Kun Wang,Yu Qiao,Hengshuang Zhao

2024-07-29

Abstract:Although video perception models have made remarkable advancements in recent years, they still heavily rely on explicit text descriptions or pre-defined categories to identify target instances before executing video perception tasks. These models, however, fail to proactively comprehend and reason the user's intentions via textual input. Even though previous works attempt to investigate solutions to incorporate reasoning with image segmentation, they fail to reason with videos due to the video's complexity in object motion. To bridge the gap between image and video, in this work, we propose a new video segmentation task - video reasoning segmentation. The task is designed to output tracklets of segmentation masks given a complex input text query. What's more, to promote research in this unexplored area, we construct a reasoning video segmentation benchmark. Finally, we present ViLLa: Video reasoning segmentation with a Large Language Model, which incorporates the language generation capabilities of multimodal Large Language Models (LLMs) while retaining the capabilities of detecting, segmenting, and tracking multiple instances. We use a temporal-aware context aggregation module to incorporate contextual visual cues to text embeddings and propose a video-frame decoder to build temporal correlations across segmentation tokens. Remarkably, our ViLLa demonstrates capability in handling complex reasoning and referring video segmentation. Also, our model shows impressive ability in different temporal understanding benchmarks. Both quantitative and qualitative experiments show our method effectively unlocks new video reasoning segmentation capabilities for multimodal LLMs. The code and dataset will be available at <a class="link-external link-https" href="https://github.com/rkzheng99/ViLLa" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the problem of video reasoning segmentation. Specifically, current video perception models still heavily rely on explicit text descriptions or predefined categories to identify target instances when performing video perception tasks. However, these models cannot actively understand user intent through text input. Although previous works have attempted to combine reasoning with image segmentation, they have failed to address the reasoning problem in videos due to the complexity of object motion in videos. To solve this problem, the authors propose a new video segmentation task—video reasoning segmentation. This task aims to output a sequence of segmentation masks based on complex text queries. Additionally, to promote research in this field, the authors constructed a video reasoning segmentation benchmark dataset and proposed the ViLLa model, which combines the language generation capabilities of large multimodal language models (LLMs) while retaining the ability to detect, segment, and track multiple instances. By introducing a temporal-aware context aggregation module and a video frame decoder, ViLLa can handle complex reasoning tasks and perform excellently in various temporal understanding benchmarks. In summary, the main contributions of the paper include: 1. Introducing the video reasoning segmentation task, enabling models to perform pixel-level video reasoning based on implicit user instructions. 2. Constructing a comprehensive benchmark dataset containing 1934 video-instruction-mask samples to evaluate video reasoning segmentation performance. 3. Proposing the ViLLa model as a novel large multimodal model for video reasoning segmentation, achieving state-of-the-art results in various video understanding benchmarks.

ViLLa: Video Reasoning Segmentation with Large Language Model

VISA: Reasoning Video Object Segmentation via Large Language Models

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

VideoLLM: Modeling Video Sequence with Large Language Models

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

Streaming Long Video Understanding with Large Language Models

Video Understanding with Large Language Models: A Survey

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

Enhancing Advanced Visual Reasoning Ability of Large Language Models

LongVLM: Efficient Long Video Understanding via Large Language Models

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Valley: Video Assistant with Large Language model Enhanced abilitY

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs