Abstract:Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities 'Text + X' within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents, and an Any-to-Many Instruction Template designed for producing Xs signal prompts. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates the learning of the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG task in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing multi - modal large language models (MLLMs) can only generate one - to - multi - modal combinations (i.e., "text + X"), such as "text + image" or "text + audio", and are unable to generate multi - modal content with arbitrary combinations in one response. To overcome this limitation, the author introduced a new efficient Any - to - Multi - Modal Generation (AMMG) framework called Spider, which can generate multi - modal content with arbitrary combinations, such as "text + image + audio + video". ### Specific Problem Description 1. **Limitations of Existing Models**: - Existing multi - modal large language models (MLLMs) are limited to generating one - to - multi - modal combinations (Text + X), such as Text + Image or Text + Audio. - Users need to interact multiple times to obtain multi - modal content, resulting in a disjointed user experience. 2. **Objectives**: - Achieve Any - to - Many Modalities Generation (AMMG), that is, generate multi - modal content with arbitrary combinations (Text + Xs) in one response, such as Text + {Image, Audio, Video}. ### Main Contributions of the Spider Model 1. **Proposing a New AMMG Framework**: - The Spider framework can generate multi - modal content with arbitrary combinations in one response, enhancing the user experience. 2. **Designing an Efficient Decoders - Controller**: - Through the Unified Decoder Projector and TM - Fusion module, efficient scheduling and control of multiple decoders are achieved. 3. **Designing an Any - to - Many Instruction Template**: - Enable large language models (LLM) to understand multi - modal instructions and generate multi - modal signal prompts, so as to accurately perform AMMG. 4. **Constructing a New Multi - Modal Dataset (TMM dataset)**: - Constructed the Text - formatted Many - Modal (TMM) dataset for training the Spider model to make it have X - to - Xs capabilities. - Generated the first pseudo - X - to - Xs multi - modal dataset, providing rich data support for future AMMG task research. ### Summary This paper aims to break through the existing multi - modal generation paradigm. By proposing the Spider model and its related components, it has achieved the generation of multi - modal content with arbitrary combinations in one response, solved the limitations of existing models in multi - modal generation, and provided new dataset and method support for future research.

Spider: Any-to-Many Multimodal LLM

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

LLMs Meet Multimodal Generation and Editing: A Survey

NExT-GPT: Any-to-Any Multimodal LLM

LLMs Can Evolve Continually on Modality for X-Modal Reasoning

Large Multimodal Agents: A Survey

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

OneLLM: One Framework to Align All Modalities with Language

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

ModaVerse: Efficiently Transforming Modalities with LLMs

MM-LLMs: Recent Advances in MultiModal Large Language Models

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Large AI Model Empowered Multimodal Semantic Communications

MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs