EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

Xiangyu Zhao,Bo Liu,Qijiong Liu,Guangyuan Shi,Xiao-Ming Wu
2024-05-17
Abstract:We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs), Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities,EasyGen leverages BiDiffuser,a bidirectional conditional diffusion model, to foster more efficient modality interactions. Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space, Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the challenges of existing multimodal generation models in terms of data efficiency, high - quality image generation, and extensibility. Specifically: 1. **Data Efficiency**: Existing multimodal models usually rely on large - scale training data to align information in different modalities, which not only increases the training cost but also limits the generalization ability of the model. EasyGen simplifies the alignment process between modalities by using BiDiffuser (a bidirectional conditional diffusion model), thus achieving efficient training with less data. 2. **High - Quality Image Generation**: Many existing multimodal models generate low - quality images mainly because these models may lose information when aligning text and image embedding spaces. EasyGen improves the quality of image generation by designing an adapter to align the text space of the LLM with the image space of the BiDiffuser, making full use of the semantic understanding and reasoning ability of the LLM. 3. **Extensibility**: Existing multimodal models often focus on understanding multimodal content and lack the ability to generate multimodal responses. EasyGen can not only understand multimodal inputs but also generate multimodal responses containing text and images, having better extensibility. The EasyGen model proposed in the paper combines the bidirectional conditional diffusion model (BiDiffuser) and the large - language model (LLM) and solves the above problems through the following methods: - **Bidirectional Conditional Diffusion Model (BiDiffuser)**: By fine - tuning UniDiffuser to make it more focused on specific image - to - text and text - to - image tasks, the performance of the model on these tasks is improved. - **Projection Layer**: Used to connect BiDiffuser and LLM to achieve text generation. - **Adapter**: Used to inject the text representation of the LLM into BiDiffuser to improve the quality of image generation. - **Instruction Tuning**: Perform instruction tuning on the LLM to make it able to understand multimodal tasks and generate appropriate responses. Through these methods, EasyGen has achieved significant improvements in data efficiency, image generation quality, and multimodal generation ability.