Abstract:Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is that the reasoning ability of Multimodal Large Language Models (MLLMs) has not been systematically studied and evaluated. Although MLLMs perform well in a variety of multimodal tasks, the specific performance, limitations, and improvement methods of their reasoning ability remain unclear. Therefore, this review article aims to comprehensively review the existing multimodal reasoning evaluation protocols, classify and present the cutting - edge progress of MLLMs, introduce the latest application trends of MLLMs in reasoning - intensive tasks, and discuss current research practices and future directions. ### Specific problems include: 1. **Definition and evaluation of reasoning ability**: - Clearly define the reasoning ability of MLLMs. - Introduce the existing reasoning evaluation protocols to ensure that these protocols can truly reflect the model's reasoning ability. 2. **Summary of the current state of existing MLLMs**: - Summarize the current technical architectures, training data, and training stages of MLLMs. - Analyze the performance of different MLLMs in various benchmark tests. 3. **Multimodal instruction fine - tuning**: - Explore how to improve the reasoning ability of MLLMs through multimodal instruction fine - tuning. 4. **Reasoning - intensive applications**: - Research the applications of MLLMs in fields such as Embodied AI and tool use. 5. **Analysis of multimodal reasoning benchmark test results**: - Analyze the performance of MLLMs in multimodal reasoning benchmark tests and identify areas that need improvement. 6. **Future research directions**: - Provide in - depth insights into the current state and point out directions for future research. ### Importance of the paper By systematically evaluating and analyzing the reasoning ability of MLLMs, this paper hopes to establish a solid foundation for this important field and provide guidance for future multimodal reasoning research. This will not only help promote the development of MLLMs but also contribute to the goal of achieving Strong AI or Artificial General Intelligence (AGI). ### Key formulas and concepts - **Types of reasoning**: - **Deductive Reasoning**: Derive conclusions from known premises. \[ \text{If } P \rightarrow Q, \text{ and } P \text{ holds, then } Q \text{ can be inferred} \] - **Abductive Reasoning**: Infer the most likely cause from the observed results. - **Analogical Reasoning**: Transfer knowledge from one instance to another based on similarity. These types of reasoning are the key capabilities of MLLMs when handling complex tasks, and this paper conducts in - depth discussions around these capabilities.

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

Towards Reasoning in Large Language Models: A Survey

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

A Survey on Evaluation of Multimodal Large Language Models

A Survey on Multimodal Large Language Models

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

A Survey on Benchmarks of Multimodal Large Language Models

Efficient Multimodal Large Language Models: A Survey

Reasoning with Large Language Models, a Survey

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

LLMs for Relational Reasoning: How Far are We?

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

Large Multimodal Agents: A Survey