GlyphDiffusion: Text Generation as Image Generation

Junyi Li,Wayne Xin Zhao,Jian-Yun Nie,Ji-Rong Wen
2023-05-08
Abstract:Diffusion models have become a new generative paradigm for text generation. Considering the discrete categorical nature of text, in this paper, we propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation. Our key idea is to render the target text as a glyph image containing visual language content. In this way, conditional text generation can be cast as a glyph image generation task, and it is then natural to apply continuous diffusion models to discrete texts. Specially, we utilize a cascaded architecture (ie a base and a super-resolution diffusion model) to generate high-fidelity glyph images, conditioned on the input text. Furthermore, we design a text grounding module to transform and refine the visual language content from generated glyph images into the final texts. In experiments over four conditional text generation tasks and two classes of metrics (ie quality and diversity), GlyphDiffusion can achieve comparable or even better results than several baselines, including pretrained language models. Our model also makes significant improvements compared to the recent diffusion model.
Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of applying Diffusion Models in Natural Language Processing for Conditional Text Generation. Although diffusion models have achieved significant success in generating continuous data such as images and audio, their application to discrete text generation still faces challenges. Traditional text generation methods based on diffusion models either handle the discreteness of text by introducing a discrete diffusion process or represent text as continuous embedding vectors. However, both approaches have certain limitations, such as training instability and loss function collapse. To overcome these challenges, this paper proposes a new method called GlyphDiffusion, which transforms the target text into images containing visual language content (referred to as glyph images), thereby converting the conditional text generation task into a glyph image generation task. This method leverages the advantages of continuous diffusion models and avoids the issues present in traditional methods. Specifically, GlyphDiffusion is implemented through the following steps: 1. **Text Rendering**: Convert the input text into glyph images, with each image containing the visual form of the text content. 2. **Conditional Encoding**: Use a pre-trained language model (such as T5) to encode the input text, capturing its semantic information. 3. **Text-Guided Glyph Image Diffusion**: Employ a cascaded architecture (including a base diffusion model and a super-resolution diffusion model) to generate high-fidelity glyph images. 4. **Text Grounding**: Design a text grounding module to convert the visual language content in the generated glyph images into the final text output. Through these steps, GlyphDiffusion is able to achieve results comparable to or even better than existing baseline models in multiple conditional text generation tasks, particularly excelling in quality (such as BLEU and ROUGE-L) and diversity (such as Distinct and Diverse-4).