From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Chaochao Lu,Chen Qian,Guodong Zheng,Hongxing Fan,Hongzhi Gao,Jie Zhang,Jing Shao,Jingyi Deng,Jinlan Fu,Kexin Huang,Kunchang Li,Lijun Li,Limin Wang,Lu Sheng,Meiqi Chen,Ming Zhang,Qibing Ren,Sirui Chen,Tao Gui,Wanli Ouyang,Yali Wang,Yan Teng,Yaru Wang,Yi Wang,Yinan He,Yingchun Wang,Yixu Wang,Yongting Zhang,Yu Qiao,Yujiong Shen,Yurong Mou,Yuxi Chen,Zaibin Zhang,Zhelun Shi,Zhenfei Yin,Zhipin Wang

2024-01-29

Abstract:Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance understanding of the gap through the lens of a qualitative study on the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities: ie, text, code, image, and video, ultimately aiming to improve the transparency of MLLMs. We believe these properties are several representative factors that define the reliability of MLLMs, in supporting various downstream applications. To be specific, we evaluate the closed-source GPT-4 and Gemini and 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designed cases, where the qualitative results are then summarized into 12 scores (ie, 4 modalities times 3 properties). In total, we uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs, towards more reliable downstream multi-modal applications.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper mainly discusses the gaps in the generalization ability, credibility, and causal reasoning of current multimodal large language models (MLLMs), and evaluates the performance of these models through case studies of four modalities (text, code, image, and video). The study involves closed-source GPT-4 and Gemini as well as six open-source LLMs and MLLMs. The paper designs 232 artificial cases and summarizes 12 evaluations to understand the capabilities and limitations of these models and promote the reliability of downstream multimodal applications. The authors found that although GPT-4 and Gemini perform well in some aspects, they and open-source models have shortcomings in multilingual comprehension, mathematical and reasoning abilities, domain knowledge application, credibility and security of text and code, causal relation understanding, and video processing. For example, Gemini outperforms GPT-4 in multilingual translation but performs poorly in mathematical and reasoning tasks. In image understanding, all models have room for improvement in precise localization and information extraction. In video understanding, the models have limited capabilities in handling complex reasoning tasks. Furthermore, the paper highlights the challenges of models in security and ethics, such as susceptibility to induced errors, inaccurate chemical knowledge, and insufficient identification of harmful information. Overall, this research reveals several key areas in which existing MLLMs need improvement for practical applications.

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

A Survey on Multimodal Large Language Models

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

A Survey on Evaluation of Multimodal Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

A Survey on Benchmarks of Multimodal Large Language Models

MM-LLMs: Recent Advances in MultiModal Large Language Models

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning