Abstract:While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: <a class="link-external link-https" href="https://next-gpt.github.io/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue that Multimodal Large Language Models (MM-LLMs) possess multimodal understanding capabilities at the input end but lack multimodal generation capabilities at the output end. Specifically, although existing multimodal large language models can handle various types of modal information (such as text, images, videos, audio, etc.) at the input end, they typically can only generate a combination of text and images at the output end, failing to achieve seamless conversion between any modalities. The paper proposes a system called NExT-GPT, which is an end-to-end general multimodal large language model for arbitrary modal input and output. NExT-GPT can accept input in any modality and generate output in the corresponding modality, including but not limited to text, images, videos, and audio. By connecting existing high-performance encoders and diffusion decoders, NExT-GPT not only avoids the high cost of training the entire system from scratch but also achieves efficient and flexible multimodal understanding and generation capabilities. The main contributions include: 1. **End-to-end arbitrary modal input and output**: NExT-GPT can handle input and output in various modalities such as text, images, videos, and audio. 2. **Lightweight alignment learning technique**: Through lightweight alignment learning techniques at the encoding and decoding ends, high-efficiency semantic alignment can be achieved by adjusting only a small number of parameters. 3. **Cross-modal instruction fine-tuning dataset**: A high-quality cross-modal instruction fine-tuning dataset (MosIT) was manually annotated, covering complex cross-modal understanding and generation tasks.

NExT-GPT: Any-to-Any Multimodal LLM

NExT-GPT: Any-to-Any Multimodal LLM

X-Gacmn: An X-Shaped Generative Adversarial Cross-Modal Network With Hypersphere Embedding

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

MIO: A Foundation Model on Multimodal Tokens

WorldGPT: Empowering LLM as Multimodal World Model

Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI

NeuGPT: Unified multi-modal Neural GPT

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

A Survey on Multimodal Large Language Models

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Spider: Any-to-Many Multimodal LLM

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding