Abstract:Efficient image tokenization with high compression ratios remains a critical challenge for training generative models. We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL. It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VQE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models. Code and model are released.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve include: 1. **Image Tokenization with High Compression Ratio**: - In the generative model, how to achieve efficient image tokenization to obtain a higher compression ratio, thereby reducing the consumption of computational resources. The existing VQ - VAE and KL - VAE methods will significantly reduce the reconstruction quality when increasing the compression ratio. 2. **Improving the Quality of the Latent Space**: - How to improve the latent representations learned by the tokenizer to make them more discriminative, so as to better support downstream generative tasks. Existing methods have deficiencies in the learning of the latent space, especially in discrete quantization and Gaussian constraints. ### Solutions To solve the above problems, the author proposes SoftVQ - VAE, which is a continuous image tokenizer with the following characteristics: - **Soft - classified Posterior Distribution**: By introducing the soft - classified posterior distribution, each latent codeword can adaptively aggregate multiple learnable codewords, thereby greatly improving the representational ability of the latent space. \[ q_\phi(z|x)=\text{Softmax}\left(-\frac{\|\hat{z}-C\|^2}{\tau}\right) \] where $\tau$ is the temperature parameter, which controls the sharpness of the softmax probability. - **Fully - differentiable Design**: SoftVQ - VAE is fully differentiable and does not need to use codebook loss or commit loss like VQ - VAE. This enables the encoder and codewords to be directly optimized from the reconstruction loss and other losses. - **Representation Alignment**: By aligning with the features of the pre - trained visual encoder, the quality of the latent space is further improved. Specifically, by duplicating each latent codeword and aligning it with the image features, it is ensured that the latent space captures semantically discriminative features. ### Experimental Results Experiments show that SoftVQ - VAE not only performs excellently in the compression ratio, but also significantly improves the efficiency and performance of the generative model while maintaining high - quality reconstruction. For example, when generating 256×256 and 512×512 images, the inference throughput is increased by 18 times and 55 times respectively, and a competitive level is reached in the FID score (1.78 and 2.21 respectively). In addition, SoftVQ - VAE also reduces the number of training iterations and improves the training efficiency. In summary, this paper aims to solve the problems of efficient image tokenization and latent space quality improvement by proposing SoftVQ - VAE, thereby promoting the development of generative models.

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

An Image is Worth 32 Tokens for Reconstruction and Generation

MaskBit: Embedding-free Image Generation via Bit Tokens

Factorized Visual Tokenization and Generation

Image Understanding Makes for A Good Tokenizer for Image Generation

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers

3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

Image and Video Tokenization with Binary Spherical Quantization

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Efficient Online Inference of Vision Transformers by Training-Free Tokenization

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Unifying Generation and Compression: Ultra-low bitrate Image Coding Via Multi-stage Transformer

Efficient Vision Transformer via Token Merger

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

All in Tokens: Unifying Output Space of Visual Tasks Via Soft Token

Wavelet-Based Image Tokenizer for Vision Transformers

Make A Long Image Short: Adaptive Token Length for Vision Transformers

ImageFolder: Autoregressive Image Generation with Folded Tokens