Abstract:We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100$\times$ with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4$\times$ throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression standards. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the inefficiency, poor scalability, and unsatisfactory compression performance of existing image and video tokenization techniques when dealing with high - dimensional visual data. Specifically: 1. **Efficiency and Scalability**: Traditional vector quantization methods (such as VQ - VAE) require complex architectural adjustments when processing videos to adapt to the conversion from spatial convolution to spatio - temporal convolution. This not only increases the computational cost but also leads to sub - optimal quantization results. Moreover, as the codebook size increases, the running time of vector quantization methods grows linearly, and it is prone to overfitting on small datasets, especially when processing video inputs, which require larger codebooks to represent static visual patterns and dynamic motion patterns. 2. **Compression Performance**: When compressing visual data, existing tokenization methods can achieve a high compression ratio, but often with large distortion, which affects the reconstruction quality. To address these challenges, the paper proposes a novel image and video tokenizer based on Vision Transformer and Binary Spherical Quantization (BSQ). BSQ projects high - dimensional visual embeddings onto a low - dimensional hypersphere and then applies binary quantization, achieving a parameter - efficient, scalable, and compact tokenization scheme. This method can not only significantly improve the quality of visual reconstruction but also achieve up to 100 - fold data compression while maintaining low distortion. The main contributions in the paper include: - **Parameter - Efficiency**: BSQ does not require an explicit codebook, so it is more parameter - efficient. - **Scalability**: It can flexibly adapt to tokenization of any dimension and support variable - length videos as input. - **Compression Performance**: It can achieve efficient compression of visual data with minimal distortion. - **Reconstruction Quality**: In image and video reconstruction benchmarks, BSQ - ViT has achieved state - of - the - art visual reconstruction quality and also performs well in video compression. In conclusion, by introducing BSQ, this paper provides a new solution that effectively addresses the deficiencies of existing tokenization techniques in terms of efficiency, scalability, and compression performance.

Image and Video Tokenization with Binary Spherical Quantization

Taming Vector-Wise Quantization for Wide-Range Image Blending with Smooth Transition

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Scaling Image Tokenizers with Grouped Spherical Quantization

Video Transformer based Video Quality Assessment with Spatiotemporally adaptive Token Selection and Assembly

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Vision Transformer-based Semantic Communications With Importance-Aware Quantization

PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers

Factorized Visual Tokenization and Generation

Image Understanding Makes for A Good Tokenizer for Image Generation

Anticipatory stress in children and adolescents.

Spectral Image Tokenizer

HyperVQ: MLR-based Vector Quantization in Hyperbolic Space

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Unifying Generation and Compression: Ultra-low bitrate Image Coding Via Multi-stage Transformer

Bi-ViT: Pushing the Limit of Vision Transformer Quantization

Q-ViT: Fully Differentiable Quantization for Vision Transformer

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging