MM-LLMs: Recent Advances in MultiModal Large Language Models

Duzhen Zhang,Yahan Yu,Jiahua Dong,Chenxing Li,Dan Su,Chenhui Chu,Dong Yu
2024-05-28
Abstract:In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing 126 MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.
Computation and Language
What problem does this paper attempt to address?
This paper focuses on the latest progress of multi-modal large language models (MM-LLMs). These models enhance existing large language models (LLMs) through low-cost training strategies to handle multi-modal inputs or outputs. MM-LLMs not only retain the language reasoning and decision-making abilities of LLMs, but also extend their applications to various multi-modal tasks. The paper outlines the general design of the model architecture and training pipeline, establishes a classification of 126 state-of-the-art MM-LLMs, and reviews the performance of selected models on mainstream benchmark tests. In addition, the paper discusses key training methods to enhance the effectiveness of MM-LLMs and explores future research directions. Finally, the paper maintains a real-time updated website to track the latest developments in this field.