Abstract:Recent developments in multimodal methodologies have marked the beginning of an exciting era for models adept at processing diverse data types, encompassing text, audio, and visual content. Models like GPT-4V, which merge computer vision with advanced language processing, exhibit extraordinary proficiency in handling intricate tasks that require a simultaneous understanding of both textual and visual information. Prior research efforts have meticulously evaluated the efficacy of these Vision Large Language Models (VLLMs) in various domains, including object detection, image captioning, and other related fields. However, existing analyses have often suffered from limitations, primarily centering on the isolated evaluation of each modality's performance while neglecting to explore their intricate cross-modal interactions. Specifically, the question of whether these models achieve the same level of accuracy when confronted with identical task instances across different modalities remains unanswered. In this study, we take the initiative to delve into the interaction and comparison among these modalities of interest by introducing a novel concept termed cross-modal consistency. Furthermore, we propose a quantitative evaluation framework founded on this concept. Our experimental findings, drawn from a curated collection of parallel vision-language datasets developed by us, unveil a pronounced inconsistency between the vision and language modalities within GPT-4V, despite its portrayal as a unified multimodal model. Our research yields insights into the appropriate utilization of such models and hints at potential avenues for enhancing their design.

Exploring the Capabilities of Large Multimodal Models on Dense Text

On the Hidden Mystery of OCR in Large Multimodal Models

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

What Large Language Models Bring to Text-rich VQA?

MMR: Evaluating Reading Ability of Large Multimodal Models

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

A Survey on Multimodal Large Language Models

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Cross-Modal Consistency in Multimodal Large Language Models

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

A Survey on Evaluation of Multimodal Large Language Models

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

A Survey of Multimodal Large Language Model from A Data-centric Perspective

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models