GIVT: Generative Infinite-Vocabulary Transformers

Michael Tschannen,Cian Eastwood,Fabian Mentzer
2024-07-18
Abstract:We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a $\beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to completely remove the quantization step in Generative Transformers in order to directly generate a sequence of continuous - value vectors, rather than selecting discrete tokens from a finite vocabulary. Specifically, the paper introduces a new model - Generative Infinite - Vocabulary Transformers (GIVT), which can generate a sequence of vectors with real - value entries without using traditional quantization methods (such as VQ - VAE). GIVT achieves this through two simple modifications: 1. **Input layer**: At the input end, GIVT replaces the lookup table of the finite vocabulary with a linear projection to directly embed the input vector. 2. **Output layer**: At the output end, GIVT replaces the logits prediction that is usually mapped to a categorical distribution with the parameter prediction of the Multivariate Gaussian Mixture Model (GMM). These modifications enable GIVT to achieve better performance than VQ - GAN and MaskGIT in image generation tasks, and compete with the recent Latent Diffusion Models at high resolutions. In addition, GIVT also performs well in tasks such as panoptic segmentation and depth estimation. ### Main contributions 1. **Image generation performance**: GIVT significantly outperforms VQ - GAN and its variants, as well as MaskGIT in conditional image generation tasks, and competes with the strong Latent Diffusion Model baseline at high resolutions. 2. **Sampling techniques**: The paper derives variants of standard sampling techniques applicable to the continuous case, such as temperature sampling, beam search, and Classifier - Free Guidance (CFG), and demonstrates their effectiveness. 3. **Representation learning**: GIVT matches or exceeds previous sequential image generation models in representation learning, with a significantly reduced computational cost. 4. **Dense prediction tasks**: GIVT performs well in dense prediction tasks such as semantic segmentation and monocular depth estimation, comparable to the VQ - based UViM method. ### Method overview 1. **VAE training**: First, train a β - VAE in a continuous latent space with a Gaussian encoder and prior. Given an input image \(x\), the encoder predicts the mean \(\mu\) and covariance \(\sigma\), and uses the reparameterization trick to sample a representation \(z\) from \(\mathcal{N}(\mu,\sigma)\). The VAE decoder maps the latent sequence back to the image. 2. **GIVT training**: Next, train GIVT to predict \(p(z)\) or \(p(z|c)\) (when a conditional signal \(c\) is available, for example, in conditional image generation). The latent representation \(z\) is reshaped into a \(d\)-dimensional real - value vector sequence of length \(hw\). GIVT adapts to this difference through two small modifications: replacing the embedding lookup table with a linear layer at the input end and predicting the parameters of a continuous distribution at the output end. 3. **Adapter**: In order to better match the latent distributions predicted by VAE and GIVT, a small invertible flow model ("adapter") is used to map the VAE latent sequence to a new latent space. 4. **Inference**: In the inference stage, sample a latent sequence from GIVT and then decode it into an image through the VAE decoder. The paper explores different sampling strategies, such as temperature sampling, beam search, and Distribution - Based Classifier - Free Guidance (DB - CFG). Through these methods, GIVT has demonstrated excellent performance in multiple visual tasks, especially in image generation and representation learning.