Abstract:We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs), Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities,EasyGen leverages BiDiffuser,a bidirectional conditional diffusion model, to foster more efficient modality interactions. Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space, Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the challenges of existing multimodal generation models in terms of data efficiency, high - quality image generation, and extensibility. Specifically: 1. **Data Efficiency**: Existing multimodal models usually rely on large - scale training data to align information in different modalities, which not only increases the training cost but also limits the generalization ability of the model. EasyGen simplifies the alignment process between modalities by using BiDiffuser (a bidirectional conditional diffusion model), thus achieving efficient training with less data. 2. **High - Quality Image Generation**: Many existing multimodal models generate low - quality images mainly because these models may lose information when aligning text and image embedding spaces. EasyGen improves the quality of image generation by designing an adapter to align the text space of the LLM with the image space of the BiDiffuser, making full use of the semantic understanding and reasoning ability of the LLM. 3. **Extensibility**: Existing multimodal models often focus on understanding multimodal content and lack the ability to generate multimodal responses. EasyGen can not only understand multimodal inputs but also generate multimodal responses containing text and images, having better extensibility. The EasyGen model proposed in the paper combines the bidirectional conditional diffusion model (BiDiffuser) and the large - language model (LLM) and solves the above problems through the following methods: - **Bidirectional Conditional Diffusion Model (BiDiffuser)**: By fine - tuning UniDiffuser to make it more focused on specific image - to - text and text - to - image tasks, the performance of the model on these tasks is improved. - **Projection Layer**: Used to connect BiDiffuser and LLM to achieve text generation. - **Adapter**: Used to inject the text representation of the LLM into BiDiffuser to improve the quality of image generation. - **Instruction Tuning**: Perform instruction tuning on the LLM to make it able to understand multimodal tasks and generate appropriate responses. Through these methods, EasyGen has achieved significant improvements in data efficiency, image generation quality, and multimodal generation ability.

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

Multimodal Latent Language Modeling with Next-Token Diffusion

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing

Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Contextualized Diffusion Models for Text-Guided Image and Video Generation

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond

LLMs Meet Multimodal Generation and Editing: A Survey

LLMGA: Multimodal Large Language Model based Generation Assistant

Diffusion Models For Multi-Modal Generative Modeling

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

DiffusionGPT: LLM-Driven Text-to-Image Generation System

Collaborative Diffusion for Multi-Modal Face Generation and Editing