One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Zechen Bai,Tong He,Haiyang Mei,Pichao Wang,Ziteng Gao,Joya Chen,Lei Liu,Zheng Zhang,Mike Zheng Shou
2024-09-29
Abstract:We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based methods, such as LISA, struggle with video tasks due to the additional temporal dimension, which requires temporal dynamic understanding and consistent segmentation across frames. VideoLISA addresses these challenges by integrating a Sparse Dense Sampling strategy into the video-LLM, which balances temporal context and spatial detail within computational constraints. Additionally, we propose a One-Token-Seg-All approach using a specially designed <TRK> token, enabling the model to segment and track objects across multiple frames. Extensive evaluations on diverse benchmarks, including our newly introduced ReasonVOS benchmark, demonstrate VideoLISA's superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking. While optimized for videos, VideoLISA also shows promising generalization to image segmentation, revealing its potential as a unified foundation model for language-instructed object segmentation. Code and model will be available at: <a class="link-external link-https" href="https://github.com/showlab/VideoLISA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of **Language - Instructed Reasoning Segmentation in Videos**. Specifically, the author proposes a multimodal large - language model (MLLM) named **VideoLISA**, which is specifically designed to handle the task of language - instructed reasoning segmentation in videos. #### Main Challenges 1. **Understanding of the Temporal Dimension**: Unlike static images, videos contain an additional temporal dimension, which requires the model to not only understand the spatial information of each frame but also the dynamic changes and consistency between frames. 2. **Application of Complex Reasoning and World Knowledge**: Language instructions provided by users may involve complex reasoning and background knowledge, which pose higher requirements for the model's understanding ability. 3. **Maintenance of Cross - Frame Consistency**: In videos, target objects may move or change, so the model needs to generate temporally consistent segmentation masks. #### Solutions To address the above challenges, VideoLISA introduces two key techniques: 1. **Sparse Dense Sampling Strategy**: - By uniformly sampling a portion of frames and retaining their high - resolution features (dense frames), while down - sampling other frames (sparse frames), to balance temporal and spatial information. - This method not only retains crucial visual details but also reduces the computational burden, ensuring that the model can effectively process video data. 2. **One - Token - Seg - All Approach**: - A special `<TRK>` token is introduced to segment and track target objects across multiple frames. - By training the `<TRK>` token to associate the representations of the same object in different frames, temporal consistency is achieved. #### Model Architecture - **Visual Tokenizer**: Converts video frames into visual tokens. - **LLM (Large - Language Model)**: Combines language instructions and video content to generate segmentation prompts. - **Vision Encoder**: Extracts the visual features of each frame. - **Promptable Mask Decoder**: Generates pixel - level segmentation masks based on the prompts. #### Evaluation and Contributions - **ReasonVOS Benchmark**: To comprehensively evaluate the model's ability in complex reasoning, temporal understanding, and object tracking, the author introduces a new benchmark - ReasonVOS. - **Experimental Results**: Extensive experiments show that VideoLISA performs excellently on multiple public benchmarks, especially in tasks involving complex reasoning and temporal understanding. Through these innovations, VideoLISA demonstrates its superior performance in video object segmentation tasks and provides a new direction for further research.