Abstract:We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based methods, such as LISA, struggle with video tasks due to the additional temporal dimension, which requires temporal dynamic understanding and consistent segmentation across frames. VideoLISA addresses these challenges by integrating a Sparse Dense Sampling strategy into the video-LLM, which balances temporal context and spatial detail within computational constraints. Additionally, we propose a One-Token-Seg-All approach using a specially designed <TRK> token, enabling the model to segment and track objects across multiple frames. Extensive evaluations on diverse benchmarks, including our newly introduced ReasonVOS benchmark, demonstrate VideoLISA's superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking. While optimized for videos, VideoLISA also shows promising generalization to image segmentation, revealing its potential as a unified foundation model for language-instructed object segmentation. Code and model will be available at: <a class="link-external link-https" href="https://github.com/showlab/VideoLISA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of **Language - Instructed Reasoning Segmentation in Videos**. Specifically, the author proposes a multimodal large - language model (MLLM) named **VideoLISA**, which is specifically designed to handle the task of language - instructed reasoning segmentation in videos. #### Main Challenges 1. **Understanding of the Temporal Dimension**: Unlike static images, videos contain an additional temporal dimension, which requires the model to not only understand the spatial information of each frame but also the dynamic changes and consistency between frames. 2. **Application of Complex Reasoning and World Knowledge**: Language instructions provided by users may involve complex reasoning and background knowledge, which pose higher requirements for the model's understanding ability. 3. **Maintenance of Cross - Frame Consistency**: In videos, target objects may move or change, so the model needs to generate temporally consistent segmentation masks. #### Solutions To address the above challenges, VideoLISA introduces two key techniques: 1. **Sparse Dense Sampling Strategy**: - By uniformly sampling a portion of frames and retaining their high - resolution features (dense frames), while down - sampling other frames (sparse frames), to balance temporal and spatial information. - This method not only retains crucial visual details but also reduces the computational burden, ensuring that the model can effectively process video data. 2. **One - Token - Seg - All Approach**: - A special `<TRK>` token is introduced to segment and track target objects across multiple frames. - By training the `<TRK>` token to associate the representations of the same object in different frames, temporal consistency is achieved. #### Model Architecture - **Visual Tokenizer**: Converts video frames into visual tokens. - **LLM (Large - Language Model)**: Combines language instructions and video content to generate segmentation prompts. - **Vision Encoder**: Extracts the visual features of each frame. - **Promptable Mask Decoder**: Generates pixel - level segmentation masks based on the prompts. #### Evaluation and Contributions - **ReasonVOS Benchmark**: To comprehensively evaluate the model's ability in complex reasoning, temporal understanding, and object tracking, the author introduces a new benchmark - ReasonVOS. - **Experimental Results**: Extensive experiments show that VideoLISA performs excellently on multiple public benchmarks, especially in tasks involving complex reasoning and temporal understanding. Through these innovations, VideoLISA demonstrates its superior performance in video object segmentation tasks and provides a new direction for further research.

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

LISA: Reasoning Segmentation via Large Language Model

VISA: Reasoning Video Object Segmentation via Large Language Models

ViLLa: Video Reasoning Segmentation with Large Language Model

LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

Empowering Segmentation Ability to Multi-modal Large Language Models

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Reasoning to Attend: Try to Understand How <SEG> Token Works

Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

VideoLLM: Modeling Video Sequence with Large Language Models

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

LITA: Language Instructed Temporal-Localization Assistant

PixelLM: Pixel Reasoning with Large Multimodal Model

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection