Abstract:The emergence of multimodal large models (MLMs) has significantly advanced the field of visual understanding, offering remarkable capabilities in the realm of visual question answering (VQA). Yet, the true challenge lies in the domain of knowledge-intensive VQA tasks, which necessitate not just recognition of visual elements, but also a deep comprehension of the visual information in conjunction with a vast repository of learned knowledge. To uncover such capabilities of MLMs, particularly the newly introduced GPT-4V and Gemini, we provide an in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which assesses how well models can understand visual cues and connect to general knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in reasoning out specific knowledge from images, showcasing their proficiency across various specialized fields; 3) Comprehensive Knowledge with Decision-making Rationales, which examines model's capability to provide logical explanations for its inference, facilitating a deeper analysis from the interpretability perspective. Additionally, we utilize a visual knowledge-enhanced training strategy and multimodal retrieval-augmented generation approach to enhance MLMs, highlighting the future need for advancements in this research direction. Extensive experiments indicate that: a) GPT-4V demonstrates enhanced explanation generation when using composite images as few-shots; b) GPT-4V and other MLMs produce severe hallucinations when dealing with world knowledge; c) Visual knowledge enhanced training and prompting technicals present potential to improve performance. Codes: <a class="link-external link-https" href="https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate multimodal large models (MLMs), especially GPT - 4V, in knowledge - intensive visual question answering tasks (VQA). Specifically, the paper aims to: 1. **Evaluate the knowledge understanding and reasoning abilities of MLMs**: - Evaluate the abilities of MLMs through three dimensions: common - sense knowledge, fine - grained world knowledge, and comprehensive knowledge with decision - making reasons. - Design a comprehensive benchmarking framework that covers a wide range of knowledge types and categories to ensure a comprehensive evaluation of the understanding, reasoning, and interpretation abilities of MLMs such as GPT - 4V. 2. **Explore the performance differences of MLMs in different knowledge domains**: - Analyze the understanding and reasoning abilities of MLMs in different knowledge categories (such as common sense, fine - grained world knowledge, etc.), and reveal their performance differences in different fields. 3. **Identify the limitations of existing MLMs and make improvement suggestions**: - Discover the main problems of current MLMs when dealing with fine - grained world knowledge, such as insufficient answers due to lack of context, visual illusions, insufficient combination of visual and knowledge dimensions, and over - reliance on visual cues. - Propose the use of enhanced training strategies and multimodal retrieval - enhanced generation methods to improve the performance of MLMs. ### Specific research contents 1. **Common - sense knowledge evaluation**: - Use a subset of the OK - VQA dataset, covering multiple common - sense knowledge categories (such as plants and animals, cooking and food, science and technology, etc.), to evaluate the performance of MLMs in common - sense knowledge question answering. - The results show that GPT - 4V performs well in most common - sense knowledge categories, but still faces challenges in some specific fields (such as vehicles and object materials). 2. **Fine - grained world knowledge evaluation**: - Use a subset of the INFOSEEK dataset, covering multiple fine - grained world knowledge categories such as geography, history, and science, to evaluate the performance of MLMs in fine - grained world knowledge question answering. - The results indicate that although GPT - 4V is generally superior to open - source MLMs, it still faces challenges when dealing with fine - grained world knowledge, especially when detailed background knowledge is required. 3. **Comprehensive knowledge evaluation with decision - making reasons**: - Use the A - OKVQA dataset to evaluate the ability of MLMs to provide logical explanations and decision - making reasons. - The results show that GPT - 4V can generate a relatively detailed reasoning process, but misjudgments or misunderstandings may still occur in some cases. ### Main findings - **MLMs have significant performance differences in different knowledge domains**: GPT - 4V performs better than other open - source MLMs in common - sense knowledge and fine - grained world knowledge, but there is still room for improvement in some specific fields. - **Fine - grained world knowledge question answering is challenging**: GPT - 4V faces four main problems when dealing with fine - grained world knowledge: lack of context, visual illusions, insufficient combination of visual and knowledge dimensions, and over - reliance on visual cues. - **Enhanced training and prompting techniques are helpful to improve performance**: Using composite images as context reference examples can effectively improve the question - answering accuracy of GPT - 4V, especially in the few - shot setting. In general, through a comprehensive evaluation of MLMs such as GPT - 4V, this paper reveals their advantages and limitations in knowledge - intensive VQA tasks and provides improvement suggestions for future research.

A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering

An Evaluation of GPT-4V and Gemini in Online VQA

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

An Early Evaluation of GPT-4V(ision)

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis

A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

An Empirical Evaluation of the GPT-4 Multimodal Language Model on Visualization Literacy Tasks

Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks

On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications

Knowledge Condensation and Reasoning for Knowledge-based VQA

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Perceptual Visual Reasoning with Knowledge Propagation