Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding

Yao Teng,Han Shi,Xian Liu,Xuefei Ning,Guohao Dai,Yu Wang,Zhenguo Li,Xihui Liu
2024-10-03
Abstract:The current large auto-regressive models can generate high-quality, high-resolution images, but these models require hundreds or even thousands of steps of next-token prediction during inference, resulting in substantial time consumption. In existing studies, Jacobi decoding, an iterative parallel decoding algorithm, has been used to accelerate the auto-regressive generation and can be executed without training. However, the Jacobi decoding relies on a deterministic criterion to determine the convergence of iterations. Thus, it works for greedy decoding but is incompatible with sampling-based decoding which is crucial for visual quality and diversity in the current auto-regressive text-to-image generation. In this paper, we propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation. By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding and allowing the model to generate diverse images. Specifically, SJD facilitates the model to predict multiple tokens at each step and accepts tokens based on the probabilistic criterion, enabling the model to generate images with fewer steps than the conventional next-token-prediction paradigm. We also investigate the token initialization strategies that leverage the spatial locality of visual data to further improve the acceleration ratio under specific scenarios. We conduct experiments for our proposed SJD on multiple auto-regressive text-to-image generation models, showing the effectiveness of model acceleration without sacrificing the visual quality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The problems the paper attempts to solve The paper aims to solve the problem that autoregressive text - to - image generation models take too long during the inference process. Specifically, although current large - scale autoregressive models can generate high - quality and high - resolution images, these models require hundreds or even thousands of next - token predictions during the inference process, resulting in a huge consumption of time. In existing research, the Jacobi Decoding algorithm has been used to accelerate autoregressive generation and can be executed without additional training. However, Jacobi Decoding depends on a deterministic convergence criterion, which makes it suitable for greedy decoding but incompatible with sampling - based decoding methods, while the latter is crucial for the visual quality and diversity of current autoregressive text - to - image generation. To overcome this problem, the paper proposes a new training - free probabilistic parallel decoding algorithm - Speculative Jacobi Decoding (SJD). By introducing a probabilistic convergence criterion, SJD can accelerate the inference process of autoregressive text - to - image generation while maintaining the randomness in the sampling process, enabling the model to generate diverse images. Specifically, SJD allows the model to predict multiple tokens at each step and accept these tokens according to the probabilistic criterion, thereby reducing the number of steps required in the traditional one - token - at - a - time prediction paradigm. In addition, the paper also explores token initialization strategies that utilize the spatial locality of visual data to further improve the acceleration ratio. ### Main contributions 1. **Proposing a new probabilistic multi - token decoding algorithm**: Speculative Jacobi Decoding (SJD), which is used to accelerate the inference process of autoregressive image generation. This method solves the problem that existing Jacobi Decoding cannot be applied to modern autoregressive text - to - image generation models that rely on sampling - based decoding. 2. **No additional training required**: Unlike existing speculative decoding methods, SJD does not need to train an additional model to predict draft tokens. 3. **Experimental verification**: Experimental results show that SJD can approximately double the inference speed of multiple autoregressive text - to - image generation models with almost no sacrifice in the quality of the generated images. In some scenarios containing simple patterns, the acceleration ratio can exceed 3 times. ### Related work - **Autoregressive image generation**: Early works such as PixelCNNs and PixelSNAIL use autoregressive strategies and convolutional neural networks to model image generation in the discrete pixel space. Models such as DALL - E and CogView compress RGB images into image tokens through discrete auto - encoders and use large - scale autoregressive models for prediction. Parti uses a transformer encoder to provide text features to achieve text - to - image generation. LlamaGen, MARS, Chameleon, Anole and Lumina - mGPT have been extended and optimized on different datasets and tasks. - **Acceleration of image generation models**: Acceleration methods for diffusion models have been widely studied, mainly focusing on shortening the denoising trajectory and reducing the computational complexity. In contrast, there are fewer studies on acceleration methods for autoregressive image generation models, mainly due to the lack of powerful base models. In early research, Jacobi Decoding was applied to PixelCNNs to accelerate inference, but lacked elaborate design for random token sampling, which affected its acceleration effect on modern autoregressive models. - **Acceleration of language models**: The autoregressive paradigm is very common in language processing, and many works focus on model compression, activation sparsification, quantization, etc. Some studies accelerate inference by parallel prediction of multiple tokens with multiple decoders, but require more memory. Speculative sampling methods use small - scale language models to assist large - scale language models in generating sequences, and verify and sample part of the sequence as part of the final output through a single forward pass. ### Method overview - **Autoregressive text - to - image generation**: The autoregressive text - to - image generation model consists of three components: a discrete image tokenizer, an autoregressive transformer generator, and an image decoder. The autoregressive transformer is the most time - consuming part and is responsible for predicting discrete image tokens according to text prompts. - **Jacobi Decoding**: Jacobi Decoding views autoregressive inference as the process of solving the fixed points of nonlinear equations in a triangular system. The algorithm iteratively performs multi - token decoding without fine - tuning or auxiliary modules. - **Speculative Jacobi Decoding**: SJD allows the model at each iteration by introducing a probabilistic convergence criterion.