Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Minghui Hu,Chuanxia Zheng,Heliang Zheng,Tat-Jen Cham,Chaoyue Wang,Zuopeng Yang,Dacheng Tao,Ponnuthurai N. Suganthan
DOI: https://doi.org/10.48550/arXiv.2211.14842
2022-11-27
Abstract:The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing generative models can only handle single - modality tasks, that is, either perform modality conversion (such as text - to - image) or unconditional single - modality generation (such as unconditional image generation). However, these methods cannot handle the generation tasks of multi - modality signals simultaneously, nor do they learn the joint distribution between different modalities. To solve this problem, the author proposes a unified multi - modality generation model UniD3 (Unified Discrete Denoising Diffusion model), which can perform "modality conversion" and "multi - modality generation" tasks simultaneously within one framework. Specifically, UniD3 can generate content based on text, image or both simultaneously, and can generate cross - modality results without a given conditional signal. ### Main contributions 1. **Unified transition matrix**: A specific Markov transition matrix is designed for the discrete denoising diffusion model, so that the joint distribution of language and image can be estimated. This transition matrix design based on task objectives and data characteristics is the first in the discrete diffusion model. 2. **Mutual attention mechanism and fusion embedding layer**: A mutual attention module with a fusion embedding layer is proposed to achieve the goal of multi - modality integration, and the unified objective function is modified to provide more concise constraints. 3. **Simultaneously handle multi - modality generation and modality conversion**: UniD3 is the first model that can simultaneously handle unconditional visual - language generation and bidirectional visual - language synthesis tasks. ### Method overview The core of UniD3 lies in two aspects: - **Unified diffusion process**: Capture the global association between different modalities by designing a unified transition matrix. - **Denoising function**: Use the Transformer architecture with a mutual attention mechanism as the denoising function, and introduce a fusion embedding layer to handle multi - modality signals. ### Experimental results Experiments show that UniD3 performs excellently in various generation tasks, can generate high - quality image - text pairs, and achieves performance comparable to the existing best methods in both multi - modality generation and modality conversion tasks. ### Formula summary - **Transition matrix**: \[ Q[t - 1\rightarrow t]=\begin{bmatrix} \alpha_t+\beta_t&\beta_t&\cdots&0\\ \beta_t&\alpha_t+\beta_t&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ \gamma_t&\gamma_t&\cdots&1 \end{bmatrix} \] where \(\alpha_t\) is the probability of keeping the token, \(\gamma_t\) is the probability of being replaced with the [MASK] token, and \(\beta_t=(1 - \alpha_t-\gamma_t)/K\) is the probability of diffusing to other states. - **Loss function**: \[ L_0=-E_{q(x_1|x_0)}[\log p_\theta(x_{img}^0|x_1,x_{txt}^0)+\log p_\theta(x_{txt}^0|x_1,x_{img}^0)] \] \[ L_{t - 1}=E_{q(x_t|x_0)}[D_{KL}(q(x_{t - 1}|x_t,x_0)\|[p_\theta(x_{img}^{t - 1}|x_t);p_\theta(x_{txt}^{t - 1}|x_t)])] \] Through these innovations, UniD3 can not only achieve excellent performance in multi - modality generation tasks, but also provide new ideas and directions for future multi - modality research.