Abstract:We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a $\beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to completely remove the quantization step in Generative Transformers in order to directly generate a sequence of continuous - value vectors, rather than selecting discrete tokens from a finite vocabulary. Specifically, the paper introduces a new model - Generative Infinite - Vocabulary Transformers (GIVT), which can generate a sequence of vectors with real - value entries without using traditional quantization methods (such as VQ - VAE). GIVT achieves this through two simple modifications: 1. **Input layer**: At the input end, GIVT replaces the lookup table of the finite vocabulary with a linear projection to directly embed the input vector. 2. **Output layer**: At the output end, GIVT replaces the logits prediction that is usually mapped to a categorical distribution with the parameter prediction of the Multivariate Gaussian Mixture Model (GMM). These modifications enable GIVT to achieve better performance than VQ - GAN and MaskGIT in image generation tasks, and compete with the recent Latent Diffusion Models at high resolutions. In addition, GIVT also performs well in tasks such as panoptic segmentation and depth estimation. ### Main contributions 1. **Image generation performance**: GIVT significantly outperforms VQ - GAN and its variants, as well as MaskGIT in conditional image generation tasks, and competes with the strong Latent Diffusion Model baseline at high resolutions. 2. **Sampling techniques**: The paper derives variants of standard sampling techniques applicable to the continuous case, such as temperature sampling, beam search, and Classifier - Free Guidance (CFG), and demonstrates their effectiveness. 3. **Representation learning**: GIVT matches or exceeds previous sequential image generation models in representation learning, with a significantly reduced computational cost. 4. **Dense prediction tasks**: GIVT performs well in dense prediction tasks such as semantic segmentation and monocular depth estimation, comparable to the VQ - based UViM method. ### Method overview 1. **VAE training**: First, train a β - VAE in a continuous latent space with a Gaussian encoder and prior. Given an input image $x$, the encoder predicts the mean $\mu$ and covariance $\sigma$, and uses the reparameterization trick to sample a representation $z$ from $\mathcal{N}(\mu,\sigma)$. The VAE decoder maps the latent sequence back to the image. 2. **GIVT training**: Next, train GIVT to predict $p(z)$ or $p(z|c)$ (when a conditional signal $c$ is available, for example, in conditional image generation). The latent representation $z$ is reshaped into a $d$-dimensional real - value vector sequence of length $hw$. GIVT adapts to this difference through two small modifications: replacing the embedding lookup table with a linear layer at the input end and predicting the parameters of a continuous distribution at the output end. 3. **Adapter**: In order to better match the latent distributions predicted by VAE and GIVT, a small invertible flow model ("adapter") is used to map the VAE latent sequence to a new latent space. 4. **Inference**: In the inference stage, sample a latent sequence from GIVT and then decode it into an image through the VAE decoder. The paper explores different sampling strategies, such as temperature sampling, beam search, and Distribution - Based Classifier - Free Guidance (DB - CFG). Through these methods, GIVT has demonstrated excellent performance in multiple visual tasks, especially in image generation and representation learning.

GIVT: Generative Infinite-Vocabulary Transformers

VideoGPT: Video Generation using VQ-VAE and Transformers

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

MAGVLT: Masked Generative Vision-and-Language Transformer

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

MaskGIT: Masked Generative Image Transformer

GIT: A Generative Image-to-text Transformer for Vision and Language

ViTGAN: Training GANs with Vision Transformers

TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

All are Worth Words: A ViT Backbone for Diffusion Models

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Not All Images Are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Exploring Vision Transformers as Diffusion Learners

JetFormer: An Autoregressive Generative Model of Raw Images and Text

DiffiT: Diffusion Vision Transformers for Image Generation

Global Context Vision Transformers

UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation

Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

TurboViT: Generating Fast Vision Transformers via Generative Architecture Search

3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation