Abstract:We present SegLLM, a novel multi-round interactive reasoning segmentation model that enhances LLM-based segmentation by exploiting conversational memory of both visual and textual outputs. By leveraging a mask-aware multimodal LLM, SegLLM re-integrates previous segmentation results into its input stream, enabling it to reason about complex user intentions and segment objects in relation to previously identified entities, including positional, interactional, and hierarchical relationships, across multiple interactions. This capability allows SegLLM to respond to visual and text queries in a chat-like manner. Evaluated on the newly curated MRSeg benchmark, SegLLM outperforms existing methods in multi-round interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in cIoU for referring expression segmentation and a 4.5% improvement in Acc@0.5 for referring expression localization.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the image segmentation task, although existing large - language models (LLM) perform well in single - round interactions, they lack effective mechanisms to maintain and utilize previous segmentation results and conversation history in multi - round interactions. This leads to a decline in performance when dealing with complex queries, especially in multi - round queries involving location, interaction and hierarchical relationships. Specifically, the paper proposes and solves the following problems: 1. **Information retention in multi - round interactions**: Existing LLM segmentation models cannot effectively remember and utilize previously generated segmentation results and conversation content in multi - round dialogues. For example, a user may first request to segment a "person in a black hoodie", and then further query based on this result, such as "the snowboard he is holding" or "the person standing on his right". Existing models perform poorly in such cases because they cannot associate previous segmentation results with new queries. 2. **Complex user - intent understanding**: In order to handle complex user queries, the model needs to be able to understand and process instructions involving multiple objects and their relationships. For example, a user may request to segment a specific part of an object (such as "a person's hair"), or perform further operations based on previously segmented objects (such as "the child sitting on the previously segmented object"). These queries not only involve simple classification but also require the model to have reasoning ability. 3. **Improving the performance of multi - round reasoning segmentation**: By introducing a novel multi - round reasoning segmentation model SegLLM, the paper aims to improve the model's performance in multi - round interactions. SegLLM re - integrates previous segmentation results into the input stream and combines conversation history, enabling the model to better understand and execute complex user instructions in multi - round interactions. ### Solutions To solve the above problems, the paper proposes the SegLLM model, whose main innovations include: - **Mask - Encoding scheme**: By re - encoding previous segmentation masks into embedding vectors and feeding them back into the input stream of the LLM, the LLM can "see" previous segmentation results. - **Mask - Aware Decoding scheme**: A reference mask decoder is introduced, which can generate new segmentation masks according to previous segmentation results and the current text query. - **Multi - round interaction dataset MRSeg**: A high - quality multi - round interaction segmentation dataset MRSeg is constructed, which contains various types of multi - round queries, such as positional relationships, interaction relationships and hierarchical relationships, to evaluate the model's performance in multi - round interactions. Through these improvements, SegLLM significantly outperforms existing methods in multi - round reasoning segmentation tasks and also achieves better performance in standard single - round reference segmentation tasks. ### Summary The main contribution of the paper is to propose a new model SegLLM that can effectively handle multi - round interaction segmentation tasks, and verify its superior performance in multi - round reasoning segmentation and standard reference segmentation tasks through experiments. This provides a new direction for future research, especially in natural language interaction and reasoning in complex visual scenes.

SegLLM: Multi-round Reasoning Segmentation

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

Empowering Segmentation Ability to Multi-modal Large Language Models

CoReS: Orchestrating the Dance of Reasoning and Segmentation

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

LISA: Reasoning Segmentation via Large Language Model

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Multimodal 3D Reasoning Segmentation with Complex Scenes

PixelLM: Pixel Reasoning with Large Multimodal Model

SegPoint: Segment Any Point Cloud via Large Language Model

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model

Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

LLMFormer: Large Language Model for Open-Vocabulary Semantic Segmentation

ViLLa: Video Reasoning Segmentation with Large Language Model

MindMerger: Efficient Boosting LLM Reasoning in non-English Languages

Text4Seg: Reimagining Image Segmentation as Text Generation