Abstract:Large Multimodal Models (LMMs) have achieved significant progress by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixel-wise segmentation through the integration of segmentation models.Despite the innovations, the textual responses and segmentation masks of existing works remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual <a class="link-external link-http" href="http://cues.To" rel="external noopener nofollow">this http URL</a> overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research.Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple and empty segmentation, and reasoning segmentation tasks. The great performance and versatility of MGLMM underscore its potential impact on advancing multimodal research.

PixelLM: Pixel Reasoning with Large Multimodal Model

PixelLM: Pixel Reasoning with Large Multimodal Model

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

LISA: Reasoning Segmentation via Large Language Model

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

Pixel Aligned Language Models

Multimodal 3D Reasoning Segmentation with Complex Scenes

See, Say, and Segment: Teaching LMMs to Overcome False Premises

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Segment Anything with Multiple Modalities

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Language-Image Models with 3D Understanding

SegLLM: Multi-round Reasoning Segmentation

LIME: Less Is More for MLLM Evaluation

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models