LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Hao Zhang,Hongyang Li,Feng Li,Tianhe Ren,Xueyan Zou,Shilong Liu,Shijia Huang,Jianfeng Gao,Lei Zhang,Chunyuan Li,Jianwei Yang

2023-12-06

Abstract:With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at <a class="link-external link-https" href="https://github.com/UX-Decoder/LLaVA-Grounding" rel="external noopener nofollow">this https URL</a> .

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of existing large multimodal models (LMMs) lacking effective grounding capabilities in visual chat. Specifically, although existing LMMs perform well in generating reasonable responses based on images and user instructions, they face challenges in providing fine-grained understanding of images, particularly in aligning specific areas and identifying relevant image regions. Additionally, the chat performance of existing LMMs significantly declines when performing grounding tasks, mainly due to the lack of a dedicated dataset for grounding visual chat (GVC). To address these issues, the paper makes the following contributions: 1. **Data Creation**: Created a high-quality GVC dataset containing 150K instances, generated through human-annotated object detection data and GPT-4's high-quality matching capabilities. 2. **Network Architecture**: Proposed an end-to-end model LLaV A-Grounding (LLaV A-G), which connects large multimodal models with grounding models, supporting object and pixel-level grounding, and can handle various visual prompts (such as tagging, clicking, boxing, and smearing). 3. **Benchmarking**: Introduced the Grounding-Bench benchmark to evaluate the comprehensive performance of models in grounding and chat tasks, and proposed an automatic evaluation pipeline using GPT-4 for assessment. Through these contributions, the paper aims to enhance the performance of LMMs in grounding visual chat, enabling them to excel in both chat and grounding tasks simultaneously.

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Learning Visual Grounding from Generative Vision and Language Model

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

LLM4VG: Large Language Models Evaluation for Video Grounding

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

GLaMM: Pixel Grounding Large Multimodal Model

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Learning Comprehensive Visual Grounding for Video Captioning

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Grounded 3D-LLM with Referent Tokens

Video-Guided Curriculum Learning for Spoken Video Grounding

F-LMM: Grounding Frozen Large Multimodal Models

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling