Abstract:Large Multimodal Models (LMMs) have achieved significant progress by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixel-wise segmentation through the integration of segmentation models.Despite the innovations, the textual responses and segmentation masks of existing works remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual <a class="link-external link-http" href="http://cues.To" rel="external noopener nofollow">this http URL</a> overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research.Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple and empty segmentation, and reasoning segmentation tasks. The great performance and versatility of MGLMM underscore its potential impact on advancing multimodal research.

Generalizable Entity Grounding via Assistance of Large Language Model

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

Comprehensive Visual Grounding for Video Description

Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models

Learning Comprehensive Visual Grounding for Video Captioning

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Granular Entity Mapper: Advancing Fine-grained Multimodal Named Entity Recognition and Grounding

Entity recognition based on heterogeneous graph reasoning of visual region and text candidate

Grounding Language Models for Visual Entity Recognition

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Efficient Multi-modal Large Language Models via Visual Token Grouping

LLMFormer: Large Language Model for Open-Vocabulary Semantic Segmentation

Learning Visual Grounding from Generative Vision and Language Model