LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Weijia Shi,Xiaochuang Han,Chunting Zhou,Weixin Liang,Xi Victoria Lin,Luke Zettlemoyer,Lili Yu
2024-12-27
Abstract:We present LMFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LMFusion leverages existing Llama-3's weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features. By freezing the text-specific modules and only training the image-specific modules, LMFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities. Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LMFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3's language capabilities. We also demonstrate that this framework can adapt existing vision-language models with multimodal generation ability. Overall, this framework not only leverages existing computational investments in text-only LLMs but also enables the parallel development of language and vision capabilities, presenting a promising direction for efficient multimodal model development.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of how to effectively adapt pre - trained text - specific large language models (LLMs) into multimodal generation models, enabling them to understand and generate text and images. Specifically, the paper explores the following key issues: 1. **Endowing visual understanding and generation capabilities while preserving text performance**: - The core research question of the paper is: how to endow a pre - trained text - specific LLM with visual understanding and generation capabilities while preserving its text - processing performance? The author points out that directly fine - tuning a pre - trained text - specific LLM to handle multimodal data will lead to a significant decline in its language - processing ability. 2. **Reducing the demand for computational resources**: - Training multimodal generation models from scratch requires a large amount of computational resources, especially when dealing with data of multiple modalities. For example, training the state - of - the - art text - specific LLM such as Llama - 3 requires processing more than 15 trillion tokens. Therefore, the author proposes a method of reusing existing computational resources, avoiding retraining text - specific data, thereby significantly reducing the computational requirements. 3. **Improving the efficiency and performance of multimodal models**: - The paper proposes the LMFusion framework. It processes text and image data by introducing modality - specific Transformer modules and realizes cross - modal interaction through shared self - attention layers. Experimental results show that LMFusion not only performs well in image understanding and generation tasks but also fully preserves the text - processing ability of Llama - 3. Compared with multimodal models trained from scratch, it achieves better performance with fewer FLOPs (floating - point operations). ### Main contributions of LMFusion - **Reuse of computational resources**: LMFusion utilizes the weights of existing text - specific LLMs (such as Llama - 3), freezes the text module, and only fine - tunes the image module, thereby reducing the need for retraining text data. - **Preservation and transfer of performance**: It fully preserves the text - processing ability of the pre - trained LLM and promotes the learning of image understanding and generation capabilities. - **Efficient multimodal model development**: It provides an efficient and effective method for multimodal model development and shows significant improvement in image understanding and generation tasks. Through these contributions, LMFusion provides a new direction for the development of multimodal models, especially in the case of limited computational resources, and can better balance text and image - processing capabilities.