Image and Video Tokenization with Binary Spherical Quantization

Yue Zhao,Yuanjun Xiong,Philipp Krähenbühl
2024-06-12
Abstract:We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100$\times$ with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4$\times$ throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression standards. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods.
Computer Vision and Pattern Recognition,Information Theory,Machine Learning,Image and Video Processing
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the inefficiency, poor scalability, and unsatisfactory compression performance of existing image and video tokenization techniques when dealing with high - dimensional visual data. Specifically: 1. **Efficiency and Scalability**: Traditional vector quantization methods (such as VQ - VAE) require complex architectural adjustments when processing videos to adapt to the conversion from spatial convolution to spatio - temporal convolution. This not only increases the computational cost but also leads to sub - optimal quantization results. Moreover, as the codebook size increases, the running time of vector quantization methods grows linearly, and it is prone to overfitting on small datasets, especially when processing video inputs, which require larger codebooks to represent static visual patterns and dynamic motion patterns. 2. **Compression Performance**: When compressing visual data, existing tokenization methods can achieve a high compression ratio, but often with large distortion, which affects the reconstruction quality. To address these challenges, the paper proposes a novel image and video tokenizer based on Vision Transformer and Binary Spherical Quantization (BSQ). BSQ projects high - dimensional visual embeddings onto a low - dimensional hypersphere and then applies binary quantization, achieving a parameter - efficient, scalable, and compact tokenization scheme. This method can not only significantly improve the quality of visual reconstruction but also achieve up to 100 - fold data compression while maintaining low distortion. The main contributions in the paper include: - **Parameter - Efficiency**: BSQ does not require an explicit codebook, so it is more parameter - efficient. - **Scalability**: It can flexibly adapt to tokenization of any dimension and support variable - length videos as input. - **Compression Performance**: It can achieve efficient compression of visual data with minimal distortion. - **Reconstruction Quality**: In image and video reconstruction benchmarks, BSQ - ViT has achieved state - of - the - art visual reconstruction quality and also performs well in video compression. In conclusion, by introducing BSQ, this paper provides a new solution that effectively addresses the deficiencies of existing tokenization techniques in terms of efficiency, scalability, and compression performance.