Spider: Any-to-Many Multimodal LLM

Jinxiang Lai,Jie Zhang,Jun Liu,Jian Li,Xiaocheng Lu,Song Guo
2024-11-15
Abstract:Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities 'Text + X' within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents, and an Any-to-Many Instruction Template designed for producing Xs signal prompts. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates the learning of the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG task in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing multi - modal large language models (MLLMs) can only generate one - to - multi - modal combinations (i.e., "text + X"), such as "text + image" or "text + audio", and are unable to generate multi - modal content with arbitrary combinations in one response. To overcome this limitation, the author introduced a new efficient Any - to - Multi - Modal Generation (AMMG) framework called Spider, which can generate multi - modal content with arbitrary combinations, such as "text + image + audio + video". ### Specific Problem Description 1. **Limitations of Existing Models**: - Existing multi - modal large language models (MLLMs) are limited to generating one - to - multi - modal combinations (Text + X), such as Text + Image or Text + Audio. - Users need to interact multiple times to obtain multi - modal content, resulting in a disjointed user experience. 2. **Objectives**: - Achieve Any - to - Many Modalities Generation (AMMG), that is, generate multi - modal content with arbitrary combinations (Text + Xs) in one response, such as Text + {Image, Audio, Video}. ### Main Contributions of the Spider Model 1. **Proposing a New AMMG Framework**: - The Spider framework can generate multi - modal content with arbitrary combinations in one response, enhancing the user experience. 2. **Designing an Efficient Decoders - Controller**: - Through the Unified Decoder Projector and TM - Fusion module, efficient scheduling and control of multiple decoders are achieved. 3. **Designing an Any - to - Many Instruction Template**: - Enable large language models (LLM) to understand multi - modal instructions and generate multi - modal signal prompts, so as to accurately perform AMMG. 4. **Constructing a New Multi - Modal Dataset (TMM dataset)**: - Constructed the Text - formatted Many - Modal (TMM) dataset for training the Spider model to make it have X - to - Xs capabilities. - Generated the first pseudo - X - to - Xs multi - modal dataset, providing rich data support for future AMMG task research. ### Summary This paper aims to break through the existing multi - modal generation paradigm. By proposing the Spider model and its related components, it has achieved the generation of multi - modal content with arbitrary combinations in one response, solved the limitations of existing models in multi - modal generation, and provided new dataset and method support for future research.