Abstract:We present LMFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LMFusion leverages existing Llama-3's weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features. By freezing the text-specific modules and only training the image-specific modules, LMFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities. Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LMFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3's language capabilities. We also demonstrate that this framework can adapt existing vision-language models with multimodal generation ability. Overall, this framework not only leverages existing computational investments in text-only LLMs but also enables the parallel development of language and vision capabilities, presenting a promising direction for efficient multimodal model development.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of how to effectively adapt pre - trained text - specific large language models (LLMs) into multimodal generation models, enabling them to understand and generate text and images. Specifically, the paper explores the following key issues: 1. **Endowing visual understanding and generation capabilities while preserving text performance**: - The core research question of the paper is: how to endow a pre - trained text - specific LLM with visual understanding and generation capabilities while preserving its text - processing performance? The author points out that directly fine - tuning a pre - trained text - specific LLM to handle multimodal data will lead to a significant decline in its language - processing ability. 2. **Reducing the demand for computational resources**: - Training multimodal generation models from scratch requires a large amount of computational resources, especially when dealing with data of multiple modalities. For example, training the state - of - the - art text - specific LLM such as Llama - 3 requires processing more than 15 trillion tokens. Therefore, the author proposes a method of reusing existing computational resources, avoiding retraining text - specific data, thereby significantly reducing the computational requirements. 3. **Improving the efficiency and performance of multimodal models**: - The paper proposes the LMFusion framework. It processes text and image data by introducing modality - specific Transformer modules and realizes cross - modal interaction through shared self - attention layers. Experimental results show that LMFusion not only performs well in image understanding and generation tasks but also fully preserves the text - processing ability of Llama - 3. Compared with multimodal models trained from scratch, it achieves better performance with fewer FLOPs (floating - point operations). ### Main contributions of LMFusion - **Reuse of computational resources**: LMFusion utilizes the weights of existing text - specific LLMs (such as Llama - 3), freezes the text module, and only fine - tunes the image module, thereby reducing the need for retraining text data. - **Preservation and transfer of performance**: It fully preserves the text - processing ability of the pre - trained LLM and promotes the learning of image understanding and generation capabilities. - **Efficient multimodal model development**: It provides an efficient and effective method for multimodal model development and shows significant improvement in image understanding and generation tasks. Through these contributions, LMFusion provides a new direction for the development of multimodal models, especially in the case of limited computational resources, and can better balance text and image - processing capabilities.

LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Multimodal Latent Language Modeling with Next-Token Diffusion

Generating Images with Multimodal Language Models

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

InfMLLM: A Unified Framework for Visual-Language Tasks.

Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

ProFuser: Progressive Fusion of Large Language Models

Multimodal Pretraining from Monolingual to Multilingual

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Multimodal Language Analysis with Recurrent Multistage Fusion

Liquid: Language Models are Scalable Multi-modal Generators

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

From Image to Video, what do we need in multimodal LLMs?

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

ModaVerse: Efficiently Transforming Modalities with LLMs

Knowledge Fusion of Large Language Models