Abstract:Large Multimodal Models (LMMs) have achieved significant progress by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixel-wise segmentation through the integration of segmentation models.Despite the innovations, the textual responses and segmentation masks of existing works remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual <a class="link-external link-http" href="http://cues.To" rel="external noopener nofollow">this http URL</a> overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research.Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple and empty segmentation, and reasoning segmentation tasks. The great performance and versatility of MGLMM underscore its potential impact on advancing multimodal research.

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

MultiScan: Scalable RGBD scanning for 3D environments with articulated objects

VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

Holistic Understanding of 3D Scenes as Universal Scene Description

DIDLM:A Comprehensive Multi-Sensor Dataset with Infrared Cameras, Depth Cameras, LiDAR, and 4D Millimeter-Wave Radar in Challenging Scenarios for 3D Mapping

OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving

MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding

Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

S3E: A Large-scale Multimodal Dataset for Collaborative SLAM

Grounded 3D-LLM with Referent Tokens

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

EnvoDat: A Large-Scale Multisensory Dataset for Robotic Spatial Awareness and Semantic Reasoning in Heterogeneous Environments

M$^3$SC: A Generic Dataset for Mixed Multi-Modal (MMM) Sensing and Communication Integration

Human-centric Scene Understanding for 3D Large-scale Scenarios