GLaMM: Pixel Grounding Large Multimodal Model

Hanoona Rasheed,Muhammad Maaz,Sahal Shaji Mullappilly,Abdelrahman Shaker,Salman Khan,Hisham Cholakkal,Rao M. Anwer,Erix Xing,Ming-Hsuan Yang,Fahad S. Khan
2024-06-02
Abstract:Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of visually Grounded Conversation Generation (GCG), we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks, e.g., referring expression segmentation, image and region-level captioning and vision-language conversations.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of achieving deep integration of vision and language in multimodal models, particularly in generating natural language responses that can precisely align at the pixel level with specific objects or regions in an image. Existing large multimodal models (LMMs), while performing well in text generation, often fail to achieve precise visual alignment of the generated text with specific objects in the image, or can only handle single object categories, requiring users to specify regions and unable to provide dense pixel-level object alignment. To overcome these limitations, the paper proposes Grounding LMM (GLaMM), a model capable of generating natural language responses and seamlessly integrating them with corresponding object segmentation masks. GLaMM not only aligns with objects mentioned in the dialogue but also has the flexibility to accept text and optional visual cues (regions of interest) as input, allowing users to interact with the model at different granularity levels in both text and visual domains. Additionally, due to the lack of standard benchmarks for visual alignment dialogue generation (GCG), the authors introduce a comprehensive evaluation protocol and construct a large-scale, densely annotated dataset—the Grounding-anything Dataset (GranD)—to support model training and evaluation. Through these contributions, GLaMM aims to advance the deep integration of vision and language understanding, enhancing the quality and practicality of multimodal interactions.