GLaMM: Pixel Grounding Large Multimodal Model

Hanoona Rasheed,Muhammad Maaz,Sahal Shaji Mullappilly,Abdelrahman Shaker,Salman Khan,Hisham Cholakkal,Rao M. Anwer,Erix Xing,Ming-Hsuan Yang,Fahad S. Khan

2024-06-02

Abstract:Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of visually Grounded Conversation Generation (GCG), we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks, e.g., referring expression segmentation, image and region-level captioning and vision-language conversations.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the problem of achieving deep integration of vision and language in multimodal models, particularly in generating natural language responses that can precisely align at the pixel level with specific objects or regions in an image. Existing large multimodal models (LMMs), while performing well in text generation, often fail to achieve precise visual alignment of the generated text with specific objects in the image, or can only handle single object categories, requiring users to specify regions and unable to provide dense pixel-level object alignment. To overcome these limitations, the paper proposes Grounding LMM (GLaMM), a model capable of generating natural language responses and seamlessly integrating them with corresponding object segmentation masks. GLaMM not only aligns with objects mentioned in the dialogue but also has the flexibility to accept text and optional visual cues (regions of interest) as input, allowing users to interact with the model at different granularity levels in both text and visual domains. Additionally, due to the lack of standard benchmarks for visual alignment dialogue generation (GCG), the authors introduce a comprehensive evaluation protocol and construct a large-scale, densely annotated dataset—the Grounding-anything Dataset (GranD)—to support model training and evaluation. Through these contributions, GLaMM aims to advance the deep integration of vision and language understanding, enhancing the quality and practicality of multimodal interactions.

GLaMM: Pixel Grounding Large Multimodal Model

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

F-LMM: Grounding Frozen Large Multimodal Models

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

LLMGA: Multimodal Large Language Model based Generation Assistant

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

GroundingGPT:Language Enhanced Multi-modal Grounding Model

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Learning Visual Grounding from Generative Vision and Language Model

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data

Grounded 3D-LLM with Referent Tokens