Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Yiqi Wang,Wentao Chen,Xiaotian Han,Xudong Lin,Haiteng Zhao,Yongfei Liu,Bohan Zhai,Jianbo Yuan,Quanzeng You,Hongxia Yang
2024-01-18
Abstract:Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is that the reasoning ability of Multimodal Large Language Models (MLLMs) has not been systematically studied and evaluated. Although MLLMs perform well in a variety of multimodal tasks, the specific performance, limitations, and improvement methods of their reasoning ability remain unclear. Therefore, this review article aims to comprehensively review the existing multimodal reasoning evaluation protocols, classify and present the cutting - edge progress of MLLMs, introduce the latest application trends of MLLMs in reasoning - intensive tasks, and discuss current research practices and future directions. ### Specific problems include: 1. **Definition and evaluation of reasoning ability**: - Clearly define the reasoning ability of MLLMs. - Introduce the existing reasoning evaluation protocols to ensure that these protocols can truly reflect the model's reasoning ability. 2. **Summary of the current state of existing MLLMs**: - Summarize the current technical architectures, training data, and training stages of MLLMs. - Analyze the performance of different MLLMs in various benchmark tests. 3. **Multimodal instruction fine - tuning**: - Explore how to improve the reasoning ability of MLLMs through multimodal instruction fine - tuning. 4. **Reasoning - intensive applications**: - Research the applications of MLLMs in fields such as Embodied AI and tool use. 5. **Analysis of multimodal reasoning benchmark test results**: - Analyze the performance of MLLMs in multimodal reasoning benchmark tests and identify areas that need improvement. 6. **Future research directions**: - Provide in - depth insights into the current state and point out directions for future research. ### Importance of the paper By systematically evaluating and analyzing the reasoning ability of MLLMs, this paper hopes to establish a solid foundation for this important field and provide guidance for future multimodal reasoning research. This will not only help promote the development of MLLMs but also contribute to the goal of achieving Strong AI or Artificial General Intelligence (AGI). ### Key formulas and concepts - **Types of reasoning**: - **Deductive Reasoning**: Derive conclusions from known premises. \[ \text{If } P \rightarrow Q, \text{ and } P \text{ holds, then } Q \text{ can be inferred} \] - **Abductive Reasoning**: Infer the most likely cause from the observed results. - **Analogical Reasoning**: Transfer knowledge from one instance to another based on similarity. These types of reasoning are the key capabilities of MLLMs when handling complex tasks, and this paper conducts in - depth discussions around these capabilities.