Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

Shengcao Cao,Liang-Yan Gui,Yu-Xiong Wang
2024-10-11
Abstract:Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: <a class="link-external link-https" href="https://groundLMM.github.io" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: the challenges faced by current large - scale multimodal models (LMMs) when associating language components with visual entities, namely the so - called "grounding" problem. Specifically: 1. **Limitations of existing methods**: - **Scalability**: The scale of high - quality object - level annotated image datasets is limited (at most several million), far smaller than that of datasets containing only rough image - text pairs (which can reach billions). Therefore, using these object - level annotations can only produce visual instruction data of a limited scale. - **Supervision bias**: Changing the data focus to the grounding task may lead to catastrophic forgetting and damage the general conversational ability of LMMs. In addition, whether it is data manually annotated or pseudo - annotated by other models, there are biases and may not be aligned with general human preferences. - **Generalization ability**: Existing grounding supervision is limited to visual concepts in specific datasets or models, which restricts the model's ability to solve open - world problems. 2. **Grounding ability without explicit supervision**: - The paper proposes a new method that reveals and enhances the grounding ability of LMMs obtained through weakly - supervised visual instruction tuning without explicit grounding supervision. This method avoids the problems brought by relying on strong supervision mentioned above, making the model more scalable, general - purpose, and reducing the bias in the supervision data. 3. **Specific contributions**: - **attend - and - segment method**: By examining the attention maps in the model generation process and converting them into segmentation masks, pixel - level grounding is achieved without the need for additional grounding supervision or architectural changes. - **DIFFLMM**: A diffusion - model - based visual encoder is introduced to enhance the grounding ability of LMMs while maintaining the performance of general visual - language tasks. In summary, this paper aims to explore a method that does not rely on explicit grounding supervision to improve the grounding and generalization abilities of LMMs, so as to better cope with real - world visual - language tasks.