Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

Zanlin Ni,Yulin Wang,Renping Zhou,Jiayi Guo,Jinyi Hu,Zhiyuan Liu,Shiji Song,Yuan Yao,Gao Huang
2024-06-08
Abstract:The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful, their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work, non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However, a major drawback of these models is their inferior performance compared to diffusion models. In this paper, we aim to re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies. Specifically, we identify the complexities in properly configuring these strategies and indicate the possible sub-optimality in existing heuristic-driven designs. Recognizing this, we propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework. The resulting method, named AutoNAT, advances the performance boundaries of NATs notably, and is able to perform comparably with the latest diffusion models at a significantly reduced inference cost. The effectiveness of AutoNAT is validated on four benchmark datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M. Our code is available at <a class="link-external link-https" href="https://github.com/LeapLabTHU/ImprovedNAT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to re - evaluate the potential of non - autoregressive Transformers (NATs) in image generation tasks and propose an automatic optimization method to improve their training and inference strategies. Specifically, the paper addresses the following two main issues: 1. **Trade - off between computational efficiency and generation quality**: - Although Diffusion Models have achieved remarkable success in image generation, their computational complexity is high, resulting in significant latency and energy consumption problems in practical applications. - Non - autoregressive Transformers (NATs) have a clear advantage in inference speed, but their generation quality is usually not as good as that of Diffusion Models. 2. **Limitations of existing NATs designs**: - Existing NATs rely on heuristically designed strategies during training and generation, and these strategies may not be optimal, thus limiting the performance of NATs. - The paper points out that the parallel decoding mechanism of NATs introduces complex configuration challenges, such as how many tokens to decode each time, which tokens to choose for decoding, and how to sample tokens from the VQ codebook. To solve these problems, the authors propose a new method called AutoNAT, which searches for the optimal training and generation strategies through an automatic optimization framework. Specifically, AutoNAT improves NATs in the following ways: - **Unified optimization framework**: Transforms the design problems of training and generation strategies into a unified optimization problem and directly solves for the optimal solution, rather than relying solely on heuristic rules. - **Alternating optimization algorithm**: Designs optimization sub - problems for training strategies and generation strategies respectively, and gradually improves model performance through alternating optimization. Through this method, AutoNAT not only significantly improves the generation quality of NATs but also greatly reduces the inference cost, achieving performance comparable to the latest Diffusion Models on multiple benchmark datasets while increasing the inference speed by about 5 times. ### Formula summary Some of the key formulas mentioned in the paper are as follows: - **Re - masking ratio scheduling function** \(r(t)\) and temperature scheduling functions \(\tau_1(t), \tau_2(t)\): \[ r(t)=\cos\left(\frac{\pi t}{2T}\right) \] \[ \tau_1(t) = 1.0 \] \[ \tau_2(t)=\frac{\lambda(T - t + 1)}{T} \] - **Guidance scale scheduling function** \(s(t)\): \[ s(t)=\frac{k t}{T} \] - **Mask ratio distribution** \(p(r)\): \[ p(r)=\frac{2}{\pi}\sqrt{1 - r^2} \] These formulas are used to control different parameters in the generation and training processes to achieve better performance. ### Conclusion In general, this paper successfully improves the generation quality and inference efficiency of NATs by re - examining the design of NATs and proposing an automatic optimization method, enabling them to reach a performance level comparable to that of Diffusion Models while maintaining high efficiency.