LISA: Reasoning Segmentation via Large Language Model

Xin Lai,Zhuotao Tian,Yukang Chen,Yanwei Li,Yuhui Yuan,Shu Liu,Jiaya Jia

2024-05-01

Abstract:Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction or pre-defined categories to identify the target objects before executing visual recognition tasks. Such systems cannot actively reason and comprehend implicit user intention. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction-mask data samples, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of multimodal Large Language Models (LLMs) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a <SEG> token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving complex reasoning and world knowledge. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation data samples results in further performance enhancement. Both quantitative and qualitative experiments show our method effectively unlocks new reasoning segmentation capabilities for multimodal LLMs. Code, models, and data are available at <a class="link-external link-https" href="https://github.com/dvlab-research/LISA" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the following issues: 1. **Introducing a New Segmentation Task - Inference Segmentation**: This task requires generating binary segmentation masks based on complex, implicit text queries. Unlike traditional visual recognition systems that rely on explicit instructions or predefined categories, this new segmentation task necessitates the model's ability to understand and infer implicit user intentions. 2. **Establishing a Benchmark Dataset**: To evaluate the performance of the inference segmentation task, the authors constructed a benchmark dataset (ReasonSeg) containing over 1,000 image-instruction-mask samples. 3. **Proposing the LISA Model**: LISA is a large language-guided segmentation assistant capable of leveraging the generative capabilities of multimodal large language models (LLM) to generate segmentation masks. By introducing a special `<SEG>` token and using embeddings as masks, the model is endowed with segmentation capabilities. ### Main Contributions 1. Proposed the inference segmentation task, which requires reasoning based on implicit human instructions, crucial for building truly intelligent perception systems. 2. Introduced the LISA model, which possesses new segmentation capabilities, demonstrating strong zero-shot abilities when trained only on non-inference data, and further improving performance after fine-tuning with a small number of inference segmentation samples. 3. Established an inference segmentation benchmark dataset (ReasonSeg) containing over 1,000 image-instruction-mask samples, which is important for evaluating and encouraging the community to further explore reasoning capabilities in visual tasks.

LISA: Reasoning Segmentation via Large Language Model

LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

Empowering Segmentation Ability to Multi-modal Large Language Models

VISA: Reasoning Video Object Segmentation via Large Language Models

PixelLM: Pixel Reasoning with Large Multimodal Model

SegLLM: Multi-round Reasoning Segmentation

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

ViLLa: Video Reasoning Segmentation with Large Language Model

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Reasoning Grasping via Multimodal Large Language Model

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Multimodal 3D Reasoning Segmentation with Complex Scenes

ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model

Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

CoReS: Orchestrating the Dance of Reasoning and Segmentation

SegPoint: Segment Any Point Cloud via Large Language Model

Large Language Model with Curriculum Reasoning for Visual Concept Recognition