Cross-modal Large Language Models : Progress and Prospects

Abstract:Conversational large language models (LLMs), such as ChatGPT, have achieved remarkable advancements in in-context learning and reasoning abilities by utilizing massive training data and large-scale model parameters. Building upon the breakthroughs in text-based language models, there has recently been a significant technological trend towards understanding and generating other modalities, such as speech, images, and graphics. This trend has led to the transition into cross-modal LLMs. With the rapid development of large models, cross-modal LLMs have gradually acquired strong multimodal perception and initial cross-modal cognitive abilities. This article first provides a comprehensive overview of the evolution of cross-modal LLM technology from three perspectives: multimodal large perception models, cross-modal large cognitive models, and distributed agent systems, then summarizes the relevant evaluation benchmarks. Additionally, the article discusses the technical challenges and potential research directions that cross-modal LLMs are currently facing.
Linguistics,Computer Science
What problem does this paper attempt to address?