Abstract:Vector Quantization (VQ) is a widely used method for converting continuous representations into discrete codes, which has become fundamental in unsupervised representation learning and latent generative models. However, VQ models are often hindered by the problem of representation collapse in the latent space, which leads to low codebook utilization and limits the scalability of the codebook for large-scale training. Existing methods designed to mitigate representation collapse typically reduce the dimensionality of latent space at the expense of model capacity, which do not fully resolve the core issue. In this study, we conduct a theoretical analysis of representation collapse in VQ models and identify its primary cause as the disjoint optimization of the codebook, where only a small subset of code vectors are updated through gradient descent. To address this issue, we propose \textbf{SimVQ}, a novel method which reparameterizes the code vectors through a linear transformation layer based on a learnable latent basis. This transformation optimizes the \textit{entire linear space} spanned by the codebook, rather than merely updating \textit{the code vector} selected by the nearest-neighbor search in vanilla VQ models. Although it is commonly understood that the multiplication of two linear matrices is equivalent to applying a single linear layer, our approach works surprisingly well in resolving the collapse issue in VQ models with just one linear layer. We validate the efficacy of SimVQ through extensive experiments across various modalities, including image and audio data with different model architectures. Our code is available at \url{<a class="link-external link-https" href="https://github.com/youngsheen/SimVQ" rel="external noopener nofollow">this https URL</a>}.

Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

Resizing codebook of vector quantization without retraining

Online Clustered Codebook

ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

LG-VQ: Language-Guided Codebook Learning

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction

An Efficient Codebook Search Algorithm for Line Spectrum Frequency (LSF) Vector Quantization in Speech Codec

SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models

EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders

Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

Codebook Sharing in Multi-Stage Vector Quantization

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Robust Semantic Communications with Masked VQ-VAE Enabled Codebook

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Vector Quantization: a Review