Abstract:The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing generative models can only handle single - modality tasks, that is, either perform modality conversion (such as text - to - image) or unconditional single - modality generation (such as unconditional image generation). However, these methods cannot handle the generation tasks of multi - modality signals simultaneously, nor do they learn the joint distribution between different modalities. To solve this problem, the author proposes a unified multi - modality generation model UniD3 (Unified Discrete Denoising Diffusion model), which can perform "modality conversion" and "multi - modality generation" tasks simultaneously within one framework. Specifically, UniD3 can generate content based on text, image or both simultaneously, and can generate cross - modality results without a given conditional signal. ### Main contributions 1. **Unified transition matrix**: A specific Markov transition matrix is designed for the discrete denoising diffusion model, so that the joint distribution of language and image can be estimated. This transition matrix design based on task objectives and data characteristics is the first in the discrete diffusion model. 2. **Mutual attention mechanism and fusion embedding layer**: A mutual attention module with a fusion embedding layer is proposed to achieve the goal of multi - modality integration, and the unified objective function is modified to provide more concise constraints. 3. **Simultaneously handle multi - modality generation and modality conversion**: UniD3 is the first model that can simultaneously handle unconditional visual - language generation and bidirectional visual - language synthesis tasks. ### Method overview The core of UniD3 lies in two aspects: - **Unified diffusion process**: Capture the global association between different modalities by designing a unified transition matrix. - **Denoising function**: Use the Transformer architecture with a mutual attention mechanism as the denoising function, and introduce a fusion embedding layer to handle multi - modality signals. ### Experimental results Experiments show that UniD3 performs excellently in various generation tasks, can generate high - quality image - text pairs, and achieves performance comparable to the existing best methods in both multi - modality generation and modality conversion tasks. ### Formula summary - **Transition matrix**: \[ Q[t - 1\rightarrow t]=\begin{bmatrix} \alpha_t+\beta_t&\beta_t&\cdots&0\\ \beta_t&\alpha_t+\beta_t&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ \gamma_t&\gamma_t&\cdots&1 \end{bmatrix} \] where \(\alpha_t\) is the probability of keeping the token, \(\gamma_t\) is the probability of being replaced with the [MASK] token, and \(\beta_t=(1 - \alpha_t-\gamma_t)/K\) is the probability of diffusing to other states. - **Loss function**: \[ L_0=-E_{q(x_1|x_0)}[\log p_\theta(x_{img}^0|x_1,x_{txt}^0)+\log p_\theta(x_{txt}^0|x_1,x_{img}^0)] \] \[ L_{t - 1}=E_{q(x_t|x_0)}[D_{KL}(q(x_{t - 1}|x_t,x_0)\|[p_\theta(x_{img}^{t - 1}|x_t);p_\theta(x_{txt}^{t - 1}|x_t)])] \] Through these innovations, UniD3 can not only achieve excellent performance in multi - modality generation tasks, but also provide new ideas and directions for future multi - modality research.

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Multimodal Latent Language Modeling with Next-Token Diffusion

Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Diffusion Models For Multi-Modal Generative Modeling

Contextualized Diffusion Models for Text-Guided Image and Video Generation

Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing

Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

Unified Generative and Discriminative Training for Multi-modal Large Language Models

Collaborative Diffusion for Multi-Modal Face Generation and Editing

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models