Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Yiqi Wang,Wentao Chen,Xiaotian Han,Xudong Lin,Haiteng Zhao,Yongfei Liu,Bohan Zhai,Jianbo Yuan,Quanzeng You,Hongxia Yang

DOI: https://doi.org/10.48550/arXiv.2401.06805

2024-01-10

Computation and Language

Abstract:Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.

What problem does this paper attempt to address?

The paper primarily explores the research progress of Multimodal Large Language Models (MLLMs) in terms of reasoning capabilities and provides a systematic review of the current issues. Specifically: 1. **Research Background and Objectives**: One of the goals of Strong AI or Artificial General Intelligence (AGI) is to possess abstract reasoning abilities. In recent years, Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have demonstrated powerful capabilities in various tasks, but the reasoning abilities of these models have not yet been systematically studied. 2. **Review of Reasoning Capabilities**: The paper comprehensively reviews existing multimodal reasoning evaluation protocols, categorizes and explains the cutting-edge developments of MLLMs, introduces the latest application trends of MLLMs in reasoning-intensive tasks, and discusses current practices and future directions. 3. **Types of Reasoning**: The paper focuses on three types of reasoning: deductive reasoning, abductive reasoning, and analogical reasoning, which have wide applications in real-world tasks. 4. **Evaluation Benchmarks**: The paper analyzes current multimodal reasoning benchmark datasets, pointing out that most existing benchmark datasets do not focus on the evaluation of reasoning capabilities, and thus proposes more ideal evaluation standards. 5. **Methods to Improve Reasoning Capabilities**: The paper discusses methods to enhance the reasoning capabilities of LLMs through supervised learning, in-context learning, and prompt engineering, and explores the application of these methods in MLLMs. In summary, this paper aims to comprehensively summarize the current state of development of multimodal large language models in terms of reasoning capabilities, providing guidance and inspiration for future related research.

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

Towards Reasoning in Large Language Models: A Survey

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

A Survey on Evaluation of Multimodal Large Language Models

A Survey on Multimodal Large Language Models

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

A Survey on Benchmarks of Multimodal Large Language Models

Efficient Multimodal Large Language Models: A Survey

Reasoning with Large Language Models, a Survey

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

LLMs for Relational Reasoning: How Far are We?

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

Large Multimodal Agents: A Survey