Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Yiqi Wang,Wentao Chen,Xiaotian Han,Xudong Lin,Haiteng Zhao,Yongfei Liu,Bohan Zhai,Jianbo Yuan,Quanzeng You,Hongxia Yang
DOI: https://doi.org/10.48550/arXiv.2401.06805
2024-01-10
Computation and Language
Abstract:Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.
What problem does this paper attempt to address?
The paper primarily explores the research progress of Multimodal Large Language Models (MLLMs) in terms of reasoning capabilities and provides a systematic review of the current issues. Specifically: 1. **Research Background and Objectives**: One of the goals of Strong AI or Artificial General Intelligence (AGI) is to possess abstract reasoning abilities. In recent years, Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have demonstrated powerful capabilities in various tasks, but the reasoning abilities of these models have not yet been systematically studied. 2. **Review of Reasoning Capabilities**: The paper comprehensively reviews existing multimodal reasoning evaluation protocols, categorizes and explains the cutting-edge developments of MLLMs, introduces the latest application trends of MLLMs in reasoning-intensive tasks, and discusses current practices and future directions. 3. **Types of Reasoning**: The paper focuses on three types of reasoning: deductive reasoning, abductive reasoning, and analogical reasoning, which have wide applications in real-world tasks. 4. **Evaluation Benchmarks**: The paper analyzes current multimodal reasoning benchmark datasets, pointing out that most existing benchmark datasets do not focus on the evaluation of reasoning capabilities, and thus proposes more ideal evaluation standards. 5. **Methods to Improve Reasoning Capabilities**: The paper discusses methods to enhance the reasoning capabilities of LLMs through supervised learning, in-context learning, and prompt engineering, and explores the application of these methods in MLLMs. In summary, this paper aims to comprehensively summarize the current state of development of multimodal large language models in terms of reasoning capabilities, providing guidance and inspiration for future related research.