Abstract:Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: <a class="link-external link-https" href="https://groundLMM.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: the challenges faced by current large - scale multimodal models (LMMs) when associating language components with visual entities, namely the so - called "grounding" problem. Specifically: 1. **Limitations of existing methods**: - **Scalability**: The scale of high - quality object - level annotated image datasets is limited (at most several million), far smaller than that of datasets containing only rough image - text pairs (which can reach billions). Therefore, using these object - level annotations can only produce visual instruction data of a limited scale. - **Supervision bias**: Changing the data focus to the grounding task may lead to catastrophic forgetting and damage the general conversational ability of LMMs. In addition, whether it is data manually annotated or pseudo - annotated by other models, there are biases and may not be aligned with general human preferences. - **Generalization ability**: Existing grounding supervision is limited to visual concepts in specific datasets or models, which restricts the model's ability to solve open - world problems. 2. **Grounding ability without explicit supervision**: - The paper proposes a new method that reveals and enhances the grounding ability of LMMs obtained through weakly - supervised visual instruction tuning without explicit grounding supervision. This method avoids the problems brought by relying on strong supervision mentioned above, making the model more scalable, general - purpose, and reducing the bias in the supervision data. 3. **Specific contributions**: - **attend - and - segment method**: By examining the attention maps in the model generation process and converting them into segmentation masks, pixel - level grounding is achieved without the need for additional grounding supervision or architectural changes. - **DIFFLMM**: A diffusion - model - based visual encoder is introduced to enhance the grounding ability of LMMs while maintaining the performance of general visual - language tasks. In summary, this paper aims to explore a method that does not rely on explicit grounding supervision to improve the grounding and generalization abilities of LMMs, so as to better cope with real - world visual - language tasks.

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

F-LMM: Grounding Frozen Large Multimodal Models

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

GLaMM: Pixel Grounding Large Multimodal Model

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

Learning Visual Grounding from Generative Vision and Language Model

Grounded 3D-LLM with Referent Tokens

Generalizable Entity Grounding via Assistance of Large Language Model

Learning to Ground VLMs without Forgetting

GroundingGPT:Language Enhanced Multi-modal Grounding Model

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Visual Grounding With Joint Multimodal Representation and Interaction

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model