Abstract:The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful, their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work, non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However, a major drawback of these models is their inferior performance compared to diffusion models. In this paper, we aim to re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies. Specifically, we identify the complexities in properly configuring these strategies and indicate the possible sub-optimality in existing heuristic-driven designs. Recognizing this, we propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework. The resulting method, named AutoNAT, advances the performance boundaries of NATs notably, and is able to perform comparably with the latest diffusion models at a significantly reduced inference cost. The effectiveness of AutoNAT is validated on four benchmark datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M. Our code is available at <a class="link-external link-https" href="https://github.com/LeapLabTHU/ImprovedNAT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to re - evaluate the potential of non - autoregressive Transformers (NATs) in image generation tasks and propose an automatic optimization method to improve their training and inference strategies. Specifically, the paper addresses the following two main issues: 1. **Trade - off between computational efficiency and generation quality**: - Although Diffusion Models have achieved remarkable success in image generation, their computational complexity is high, resulting in significant latency and energy consumption problems in practical applications. - Non - autoregressive Transformers (NATs) have a clear advantage in inference speed, but their generation quality is usually not as good as that of Diffusion Models. 2. **Limitations of existing NATs designs**: - Existing NATs rely on heuristically designed strategies during training and generation, and these strategies may not be optimal, thus limiting the performance of NATs. - The paper points out that the parallel decoding mechanism of NATs introduces complex configuration challenges, such as how many tokens to decode each time, which tokens to choose for decoding, and how to sample tokens from the VQ codebook. To solve these problems, the authors propose a new method called AutoNAT, which searches for the optimal training and generation strategies through an automatic optimization framework. Specifically, AutoNAT improves NATs in the following ways: - **Unified optimization framework**: Transforms the design problems of training and generation strategies into a unified optimization problem and directly solves for the optimal solution, rather than relying solely on heuristic rules. - **Alternating optimization algorithm**: Designs optimization sub - problems for training strategies and generation strategies respectively, and gradually improves model performance through alternating optimization. Through this method, AutoNAT not only significantly improves the generation quality of NATs but also greatly reduces the inference cost, achieving performance comparable to the latest Diffusion Models on multiple benchmark datasets while increasing the inference speed by about 5 times. ### Formula summary Some of the key formulas mentioned in the paper are as follows: - **Re - masking ratio scheduling function** \(r(t)\) and temperature scheduling functions \(\tau_1(t), \tau_2(t)\): \[ r(t)=\cos\left(\frac{\pi t}{2T}\right) \] \[ \tau_1(t) = 1.0 \] \[ \tau_2(t)=\frac{\lambda(T - t + 1)}{T} \] - **Guidance scale scheduling function** \(s(t)\): \[ s(t)=\frac{k t}{T} \] - **Mask ratio distribution** \(p(r)\): \[ p(r)=\frac{2}{\pi}\sqrt{1 - r^2} \] These formulas are used to control different parameters in the generation and training processes to achieve better performance. ### Conclusion In general, this paper successfully improves the generation quality and inference efficiency of NATs by re - examining the design of NATs and proposing an automatic optimization method, enabling them to reach a performance level comparable to that of Diffusion Models while maintaining high efficiency.

Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

Emage: Non-Autoregressive Text-to-Image Generation

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation

NAT: Neural Architecture Transformer for Accurate and Compact Architectures

Optimizing Non-Autoregressive Transformers with Contrastive Learning

On the Learning of Non-Autoregressive Transformers.

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Differentiable Neural Architecture Transformation for Reproducible Architecture Improvement

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict Aware Supernet Training

Accelerating Vision Diffusion Transformers with Skip Branches

PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Non-Autoregressive Machine Translation with Auxiliary Regularization

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

Effective Diffusion Transformer Architecture for Image Super-Resolution

DiffiT: Diffusion Vision Transformers for Image Generation

Dynamic Diffusion Transformer

Self-Improvement of Non-autoregressive Model Via Sequence-Level Distillation

StraIT: Non-autoregressive Generation with Stratified Image Transformer

SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model