Text4Seg: Reimagining Image Segmentation as Text Generation

Mengcheng Lan,Chaofeng Chen,Yue Zhou,Jiaxing Xu,Yiping Ke,Xinjiang Wang,Litong Feng,Wayne Zhang

2024-10-13

Abstract:Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with $16\times16$ semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$, without compromising performance. Extensive experiments across various vision tasks, such as referring expression segmentation and comprehension, show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones. Our approach provides an efficient, scalable solution for vision-centric tasks within the MLLM framework.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to effectively integrate the image segmentation task into multi - modal large language models (MLLMs). Although MLLMs perform well on vision - language tasks, when it comes to dense prediction tasks such as semantic segmentation, seamless integration with these models still poses challenges. Existing methods usually require additional visual decoders, which not only increase the complexity of the training pipeline but also limit the extensibility of the model. This paper proposes a new text - as - mask paradigm, which redefines image segmentation as a text generation problem, thus simplifying the segmentation process, reducing the need for additional architectural changes, and making full use of the text generation capabilities of MLLMs. Specifically, the main contributions of the paper include: 1. **Proposing Text4Seg**: a new text - as - mask paradigm that transforms the image segmentation task into a text generation problem and makes full use of the text generation capabilities of MLLMs. 2. **Introducing semantic descriptors**: a new method of text sequence representation that maps each image block to its corresponding text label, forming a pure - text - represented image that can be seamlessly integrated into the autoregressive training pipeline of MLLMs, simplifying the optimization process. 3. **Developing row - level run - length encoding (R - RLE)**: a method of compressing semantic descriptors that significantly reduces their length and inference cost without sacrificing performance. 4. **Verifying the effectiveness and robustness of Text4Seg**: achieving state - of - the - art performance on multiple vision - centric tasks based on multiple MLLMs backbone networks. Through these innovations, the paper provides an efficient and extensible solution that enables MLLMs to better handle image segmentation tasks.

Text4Seg: Reimagining Image Segmentation as Text Generation

Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

MMF-CLIP: An Image-Text Multimodal Semantic Segmentation Method for Remote Sensing Images

Exploring Simple Open-Vocabulary Semantic Segmentation

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

MetaSegNet: Metadata-Collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

Empowering Segmentation Ability to Multi-modal Large Language Models

Integrated Image-Text Based on Semi-supervised Learning for Small Sample Instance Segmentation

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

Training-Free Semantic Segmentation via LLM-Supervision

Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed Images

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models