Abstract:This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in the general - purpose vision segmentation task, especially those tasks in image and video perception that require strong reasoning ability. Although the current unified segmentation methods have made remarkable progress in adapting to image and video scenes, there are still limitations in handling complex instructions and achieving accurate understanding of fine - grained vision - language associations. Therefore, this paper proposes HyperSeg, which is the first general - purpose segmentation model based on Visual Large Language Model (VLLM). It aims to solve the general - purpose segmentation tasks in pixel - level image and video perception, and at the same time has strong reasoning ability and world knowledge, and is able to handle complex vision - language reasoning perception tasks. Specifically, HyperSeg solves these problems through the following points: 1. **Introducing a hybrid entity recognition strategy**: Combine the powerful generation ability of VLLM and the final class - score decoding process to enhance the understanding of category semantics by mask tokens. 2. **Using a fine - grained visual perception module**: Merge multi - scale visual features into fixed - length fine - grained tokens, so as to efficiently capture rich visual details from different scales. 3. **Proposing a temporal adapter**: Through global prompt aggregation and local spatio - temporal information injection, achieve a comprehensive understanding of long - term and short - term vision - language information, which is especially suitable for video perception tasks. These designs enable HyperSeg to not only perform excellently in multiple segmentation tasks, but also provide outstanding performance in complex reasoning tasks. The experimental results verify the effectiveness of HyperSeg, especially in general - purpose segmentation tasks and complex reasoning tasks, showing significant advantages.

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Universal Segmentation at Arbitrary Granularity with Language Instruction

Empowering Segmentation Ability to Multi-modal Large Language Models

SegLLM: Multi-round Reasoning Segmentation

VISA: Reasoning Video Object Segmentation via Large Language Models

ViLLa: Video Reasoning Segmentation with Large Language Model

LLMFormer: Large Language Model for Open-Vocabulary Semantic Segmentation

Text4Seg: Reimagining Image Segmentation as Text Generation

LISA: Reasoning Segmentation via Large Language Model

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

GSVA: Generalized Segmentation via Multimodal Large Language Models

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Exploring Simple Open-Vocabulary Semantic Segmentation

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

IFSeg: Image-free Semantic Segmentation via Vision-Language Model

Few-Shot Classification & Segmentation Using Large Language Models Agent

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Learning Spatial-Semantic Features for Robust Video Object Segmentation