Abstract:Multimodal large language models (MLLMs), such as GPT-4o, Gemini, LLaVA, and Flamingo, have made significant progress in integrating visual and textual modalities, excelling in tasks like visual question answering (VQA), image captioning, and content retrieval. They can generate coherent and contextually relevant descriptions of images. However, they still face challenges in accurately identifying and counting objects and determining their spatial locations, particularly in complex scenes with overlapping or small objects. To address these limitations, we propose a novel framework based on multimodal retrieval-augmented generation (RAG), which introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images. Our framework improves the MLLM's capacity to handle tasks requiring precise visual descriptions, especially in scenarios with challenging perspectives, such as aerial views or scenes with dense object arrangements. Finally, we conduct extensive experiments on the VG-150 dataset that focuses on first-person visual understanding and the AUG dataset that involves aerial imagery. The results show that our approach consistently outperforms existing MLLMs in VQA tasks, which stands out in recognizing, localizing, and quantifying objects in different spatial contexts and provides more accurate visual descriptions.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in multimodal large - language models (MLLMs) in visual question - answering (VQA) tasks, specifically including: 1. **Accuracy of small - object recognition and counting**: Existing MLLMs have difficulty accurately recognizing and counting overlapping or small objects in complex scenes. For example, in an image containing multiple small or overlapping objects, MLLMs may miss or count objects incorrectly. 2. **Determination of spatial location**: MLLMs face challenges in determining the spatial location of objects, especially in complex scenes such as top - view or densely - arranged object scenes. They have difficulty providing precise absolute position descriptions (such as "Object A is in the lower - right corner"), and more often provide relative position descriptions (such as "Object A is above Object B"). 3. **Recognition of object relationships**: MLLMs also have problems in recognizing the relationships between objects, especially in cases where fine - grained descriptions are required, such as "How many objects are there in a specific category?" or "The specific relationships between objects". To solve these problems, the paper proposes a method based on the multimodal retrieval - augmented generation (RAG) framework, enhancing object recognition, relationship recognition, and spatial understanding by introducing structured scene graphs. This method aims to improve the ability of MLLMs in handling tasks that require precise visual descriptions, especially in scenes involving complex perspectives such as first - person and top - view perspectives. Verified through experiments on the VG - 150 and AUG datasets, this method is significantly superior to existing MLLMs in the recognition of object categories, quantities, locations, and relationships, especially in small - object detection, precise positioning, and relationship recognition. This indicates that this method can understand and describe image content more accurately, thereby improving the performance of VQA tasks.

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

InfMLLM: A Unified Framework for Visual-Language Tasks.

Prompting Large Language Models with Fine-Grained Visual Relations from Scene Graph for Visual Question Answering

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

LLMGA: Multimodal Large Language Model based Generation Assistant

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models

Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model

Multi-modal Auto-regressive Modeling via Visual Words

Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models