SegLLM: Multi-round Reasoning Segmentation

XuDong Wang,Shaolun Zhang,Shufan Li,Konstantinos Kallidromitis,Kehan Li,Yusuke Kato,Kazuki Kozuka,Trevor Darrell
2024-11-01
Abstract:We present SegLLM, a novel multi-round interactive reasoning segmentation model that enhances LLM-based segmentation by exploiting conversational memory of both visual and textual outputs. By leveraging a mask-aware multimodal LLM, SegLLM re-integrates previous segmentation results into its input stream, enabling it to reason about complex user intentions and segment objects in relation to previously identified entities, including positional, interactional, and hierarchical relationships, across multiple interactions. This capability allows SegLLM to respond to visual and text queries in a chat-like manner. Evaluated on the newly curated MRSeg benchmark, SegLLM outperforms existing methods in multi-round interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in cIoU for referring expression segmentation and a 4.5% improvement in Acc@0.5 for referring expression localization.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the image segmentation task, although existing large - language models (LLM) perform well in single - round interactions, they lack effective mechanisms to maintain and utilize previous segmentation results and conversation history in multi - round interactions. This leads to a decline in performance when dealing with complex queries, especially in multi - round queries involving location, interaction and hierarchical relationships. Specifically, the paper proposes and solves the following problems: 1. **Information retention in multi - round interactions**: Existing LLM segmentation models cannot effectively remember and utilize previously generated segmentation results and conversation content in multi - round dialogues. For example, a user may first request to segment a "person in a black hoodie", and then further query based on this result, such as "the snowboard he is holding" or "the person standing on his right". Existing models perform poorly in such cases because they cannot associate previous segmentation results with new queries. 2. **Complex user - intent understanding**: In order to handle complex user queries, the model needs to be able to understand and process instructions involving multiple objects and their relationships. For example, a user may request to segment a specific part of an object (such as "a person's hair"), or perform further operations based on previously segmented objects (such as "the child sitting on the previously segmented object"). These queries not only involve simple classification but also require the model to have reasoning ability. 3. **Improving the performance of multi - round reasoning segmentation**: By introducing a novel multi - round reasoning segmentation model SegLLM, the paper aims to improve the model's performance in multi - round interactions. SegLLM re - integrates previous segmentation results into the input stream and combines conversation history, enabling the model to better understand and execute complex user instructions in multi - round interactions. ### Solutions To solve the above problems, the paper proposes the SegLLM model, whose main innovations include: - **Mask - Encoding scheme**: By re - encoding previous segmentation masks into embedding vectors and feeding them back into the input stream of the LLM, the LLM can "see" previous segmentation results. - **Mask - Aware Decoding scheme**: A reference mask decoder is introduced, which can generate new segmentation masks according to previous segmentation results and the current text query. - **Multi - round interaction dataset MRSeg**: A high - quality multi - round interaction segmentation dataset MRSeg is constructed, which contains various types of multi - round queries, such as positional relationships, interaction relationships and hierarchical relationships, to evaluate the model's performance in multi - round interactions. Through these improvements, SegLLM significantly outperforms existing methods in multi - round reasoning segmentation tasks and also achieves better performance in standard single - round reference segmentation tasks. ### Summary The main contribution of the paper is to propose a new model SegLLM that can effectively handle multi - round interaction segmentation tasks, and verify its superior performance in multi - round reasoning segmentation and standard reference segmentation tasks through experiments. This provides a new direction for future research, especially in natural language interaction and reasoning in complex visual scenes.