Abstract:Diffusion models have gained significant attention in the realm of image generation due to their exceptional performance. Their success has been recently expanded to text generation via generating all tokens within a sequence concurrently. However, natural language exhibits a far more pronounced sequential dependency in comparison to images, and the majority of existing language models are trained with a left-to-right auto-regressive approach. To account for the inherent sequential characteristic of natural language, we introduce Auto-Regressive Diffusion (AR-Diffusion). AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps that vary based on token position. This results in tokens on the left undergoing fewer denoising steps than those on the right, thereby enabling them to generate earlier and subsequently influence the generation of tokens on the right. In a series of experiments on various text generation tasks, including text summarization, machine translation, and common sense generation, AR-Diffusion clearly demonstrated its superiority over existing diffusion language models and that it can be $100\times\sim600\times$ faster when achieving comparable results. Our code is available at <a class="link-external link-https" href="https://github.com/microsoft/ProphetNet/tree/master/AR-diffusion" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address two main issues in natural language generation tasks: 1. **Sequential Dependency**: Existing diffusion models in text generation typically use a Non-Auto-Regression (NAR) approach, generating all tokens in the sequence simultaneously. However, natural language has significant sequential dependency, and most existing language models are trained using an Auto-Regressive (AR) approach from left to right. Therefore, these models can better capture contextual information and sequential relationships when generating text. 2. **Trade-off Between Decoding Speed and Performance**: Although non-auto-regressive methods are faster in decoding speed compared to auto-regressive methods, they sacrifice positional dependency between tokens, leading to a decline in generation performance. Thus, how to improve decoding speed while maintaining high generation quality is an important research direction. To address these issues, the paper proposes the **Auto-Regressive Diffusion (AR-DIFFUSION)** model. AR-DIFFUSION ensures that the generation of right-side tokens depends on the left-side generated tokens by introducing a dynamic number of denoising steps. Specifically, the left-side tokens undergo fewer denoising steps, allowing them to be generated faster and influence the generation of right-side tokens. This mechanism not only retains the advantages of auto-regressive models but also incorporates the efficiency of diffusion models. ### Experimental Results The paper conducts experiments on multiple text generation tasks, including text summarization, machine translation, and commonsense generation. The experimental results show that AR-DIFFUSION outperforms existing diffusion language models in these tasks, and its decoding speed can be 100 to 600 times faster than existing models while achieving the same generation quality. Moreover, even in the extreme case of only two-step decoding, AR-DIFFUSION still performs excellently, demonstrating its potential in practical applications.

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation

Emage: Non-Autoregressive Text-to-Image Generation

Diffusion-NAT: Self-Prompting Discrete Diffusion for Non-Autoregressive Text Generation

Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise

Diffusion Models for Non-autoregressive Text Generation: A Survey

Diffusion models in text generation: a survey

Self-conditioned Embedding Diffusion for Text Generation

InfoDiffusion: Information Entropy Aware Diffusion Process for Non-Autoregressive Text Generation

A Reparameterized Discrete Diffusion Model for Text Generation

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Utilizing Latent Diffusion Model to Accelerate Sampling Speed and Enhance Text Generation Quality

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Can Diffusion Model Achieve Better Performance in Text Generation? Bridging the Gap Between Training and Inference!

Energy-Based Diffusion Language Models for Text Generation

GlyphDiffusion: Text Generation as Image Generation

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation