Generalizable Entity Grounding via Assistance of Large Language Model

Lu Qi,Yi-Wen Chen,Lehan Yang,Tiancheng Shen,Xiangtai Li,Weidong Guo,Yu Xu,Ming-Hsuan Yang
2024-02-05
Abstract:In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The paper proposes a new method called GELLA for densely locating visual entities from long captions. It utilizes a large-scale multimodal model (LMM) to extract semantic nouns, generates entity-level segmentation using a class-unaware segmentation model, and associates these results through a multimodal feature fusion module. The current method mainly suffers from the dependency on high-resolution image encoders, resulting in low computational efficiency and limited flexibility. GELLA addresses these issues by encoding the entity segmentation masks into a palette, reducing the requirement for high-resolution images and improving computational efficiency. Experimental results demonstrate that GELLA outperforms or is comparable to existing techniques in tasks including panoramic storytelling localization, referring expression segmentation, and panoramic segmentation.