Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

Maciej Kilian,Varun Jampani,Luke Zettlemoyer

2024-05-24

Abstract:Nearly every recent image synthesis approach, including diffusion, masked-token prediction, and next-token prediction, uses a Transformer network architecture. Despite this common backbone, there has been no direct, compute controlled comparison of how these approaches affect performance and efficiency. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following. On image quality, while next-token prediction initially performs better, scaling trends suggest it is eventually matched by diffusion. We compare the inference compute efficiency of each approach and find that next token prediction is by far the most efficient. Based on our findings we recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily explores the trade-offs between computational cost and performance of three image generation methods based on the Transformer architecture: diffusion, masked-token prediction, and next-token prediction. The goal of the study is to compare the relative advantages of these methods in terms of training and inference efficiency and to recommend the most suitable method for different application scenarios. The main contributions of the paper include: 1. **Direct Comparison**: For the first time, a direct and controlled comparison of these three methods was conducted. 2. **Computational Budget Analysis**: The scalability of each method was analyzed through computational budget (measured in FLOPs). 3. **Experimental Findings**: - With a smaller computational budget, next-token prediction provides the best image quality, but as the computation increases, the performance of the diffusion method gradually catches up. - Token prediction methods (especially next-token prediction) excel in following prompts. - For inference efficiency, next-token prediction far surpasses the other two methods. 4. **Recommendations**: - Diffusion models are recommended for achieving high-quality image output and low-latency applications. - For applications requiring good prompt following or high throughput, next-token prediction is recommended. Additionally, the study explores the impact of autoencoders on generation results, the effectiveness of different conditional input methods, and the role of training practices such as Exponential Moving Average (EMA). Through these analyses, the paper provides valuable insights for future researchers and developers, guiding them to make more informed decisions when choosing the appropriate image generation method.

Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

Motion Guided Token Compression for Efficient Masked Video Modeling

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Token Merging for Fast Stable Diffusion

M2T: Masking Transformers Twice for Faster Decoding

An Image is Worth 32 Tokens for Reconstruction and Generation

Importance-based Token Merging for Diffusion Models

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

Accelerating Diffusion Transformers with Token-wise Feature Caching

MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

Lazy Diffusion Transformer for Interactive Image Editing

Cross-view Masked Diffusion Transformers for Person Image Synthesis

Token Caching for Diffusion Transformer Acceleration

FlexDiT: Dynamic Token Density Control for Diffusion Transformer

EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching

Efficiency-optimized Video Diffusion Models

Not All Steps Are Created Equal: Selective Diffusion Distillation for Image Manipulation