Abstract:Built upon vector quantization (VQ), discrete audio codec models have achieved great success in audio compression and auto-regressive audio generation. However, existing models face substantial challenges in perceptual quality and signal distortion, especially when operating in extremely low bandwidth, rooted in the sensitivity of the VQ codebook to noise. This degradation poses significant challenges for several downstream tasks, such as codec-based speech synthesis. To address this issue, we propose a novel VQ method, Normal Distribution-based Vector Quantization (NDVQ), by introducing an explicit margin between the VQ codes via learning a variance. Specifically, our approach involves mapping the waveform to a latent space and quantizing it by selecting the most likely normal distribution, with each codebook entry representing a unique normal distribution defined by its mean and variance. Using these distribution-based VQ codec codes, a decoder reconstructs the input waveform. NDVQ is trained with additional distribution-related losses, alongside reconstruction and discrimination losses. Experiments demonstrate that NDVQ outperforms existing audio compression baselines, such as EnCodec, in terms of audio quality and zero-shot TTS, particularly in very low bandwidth scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced by existing audio codec models in terms of perceptual quality and signal distortion under extremely low - bandwidth conditions. In particular, when dealing with noisy input data, the vector quantization (VQ) codebook is sensitive to noise, leading to codebook collapse and low codebook utilization. These problems seriously affect the quality of downstream tasks such as codec - based speech synthesis. To address these challenges, the authors propose a new vector quantization method - normal - distribution - based vector quantization (NDVQ). NDVQ improves the robustness and generalization ability of the model by introducing learnable variances and establishing explicit safety margins for each code in the codebook. Specifically, NDVQ maps waveforms to the latent space and quantizes by selecting the most likely normal distribution. Each codebook entry is represented by a unique normal distribution defined by its mean and variance. Through this method, NDVQ can significantly improve audio quality in extremely low - bandwidth scenarios and perform excellently in zero - shot text - to - speech (TTS) tasks. ### Main problem summary: 1. **Perceptual quality and signal distortion**: Existing models have poor performance in terms of perceptual quality and signal distortion under extremely low - bandwidth conditions. 2. **Codebook sensitive to noise**: Traditional VQ methods are sensitive to noise, easily leading to codebook collapse and low codebook utilization. 3. **Impact on downstream tasks**: These problems have a negative impact on tasks such as codec - based speech synthesis. ### Solutions: - **Introduce normal - distribution - based vector quantization (NDVQ)**: By introducing learnable variances and establishing explicit safety margins for each code, improve the robustness and generalization ability of the model. - **Improve the quantization process**: Use the probability density function to select the most similar probability distribution and sample through the re - parameterization technique to obtain the quantization result. - **Optimize the training objective**: Combine reconstruction loss, discriminative loss and the modified codebook loss for training to ensure the performance of the model in low - bandwidth scenarios. Through these innovations, NDVQ shows better audio quality and stronger robustness in extremely low - bandwidth scenarios, especially performing excellently in zero - shot TTS tasks.

NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization

ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression

CQNV: A combination of coarsely quantized bitstream and neural vocoder for low rate speech coding

MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios

Neural Speech Coding for Real-time Communications using Constant Bitrate Scalar Quantization

SNAC: Multi-Scale Neural Audio Codec

An Intra-BRNN and GB-RVQ Based END-TO-END Neural Audio Codec

A Predictive VQ Based Video Compression Scheme

AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

Vector Quantized Diffusion Model Based Speech Bandwidth Extension

Srcodec: Split-Residual Vector Quantization for Neural Speech Codec.

Distributed Vector Quantization Based On Kullback-Leibler Divergence

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models

Generative De-Quantization for Neural Speech Codec via Latent Diffusion

Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding

Neural Video Compression with Feature Modulation

FreeCodec: A disentangled neural speech codec with fewer tokens

NVTC: Nonlinear Vector Transform Coding