Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

Maciej Kilian,Varun Jampani,Luke Zettlemoyer
2024-05-24
Abstract:Nearly every recent image synthesis approach, including diffusion, masked-token prediction, and next-token prediction, uses a Transformer network architecture. Despite this common backbone, there has been no direct, compute controlled comparison of how these approaches affect performance and efficiency. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following. On image quality, while next-token prediction initially performs better, scaling trends suggest it is eventually matched by diffusion. We compare the inference compute efficiency of each approach and find that next token prediction is by far the most efficient. Based on our findings we recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily explores the trade-offs between computational cost and performance of three image generation methods based on the Transformer architecture: diffusion, masked-token prediction, and next-token prediction. The goal of the study is to compare the relative advantages of these methods in terms of training and inference efficiency and to recommend the most suitable method for different application scenarios. The main contributions of the paper include: 1. **Direct Comparison**: For the first time, a direct and controlled comparison of these three methods was conducted. 2. **Computational Budget Analysis**: The scalability of each method was analyzed through computational budget (measured in FLOPs). 3. **Experimental Findings**: - With a smaller computational budget, next-token prediction provides the best image quality, but as the computation increases, the performance of the diffusion method gradually catches up. - Token prediction methods (especially next-token prediction) excel in following prompts. - For inference efficiency, next-token prediction far surpasses the other two methods. 4. **Recommendations**: - Diffusion models are recommended for achieving high-quality image output and low-latency applications. - For applications requiring good prompt following or high throughput, next-token prediction is recommended. Additionally, the study explores the impact of autoencoders on generation results, the effectiveness of different conditional input methods, and the role of training practices such as Exponential Moving Average (EMA). Through these analyses, the paper provides valuable insights for future researchers and developers, guiding them to make more informed decisions when choosing the appropriate image generation method.