NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization

Zhikang Niu,Sanyuan Chen,Long Zhou,Ziyang Ma,Xie Chen,Shujie Liu
2024-09-19
Abstract:Built upon vector quantization (VQ), discrete audio codec models have achieved great success in audio compression and auto-regressive audio generation. However, existing models face substantial challenges in perceptual quality and signal distortion, especially when operating in extremely low bandwidth, rooted in the sensitivity of the VQ codebook to noise. This degradation poses significant challenges for several downstream tasks, such as codec-based speech synthesis. To address this issue, we propose a novel VQ method, Normal Distribution-based Vector Quantization (NDVQ), by introducing an explicit margin between the VQ codes via learning a variance. Specifically, our approach involves mapping the waveform to a latent space and quantizing it by selecting the most likely normal distribution, with each codebook entry representing a unique normal distribution defined by its mean and variance. Using these distribution-based VQ codec codes, a decoder reconstructs the input waveform. NDVQ is trained with additional distribution-related losses, alongside reconstruction and discrimination losses. Experiments demonstrate that NDVQ outperforms existing audio compression baselines, such as EnCodec, in terms of audio quality and zero-shot TTS, particularly in very low bandwidth scenarios.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced by existing audio codec models in terms of perceptual quality and signal distortion under extremely low - bandwidth conditions. In particular, when dealing with noisy input data, the vector quantization (VQ) codebook is sensitive to noise, leading to codebook collapse and low codebook utilization. These problems seriously affect the quality of downstream tasks such as codec - based speech synthesis. To address these challenges, the authors propose a new vector quantization method - normal - distribution - based vector quantization (NDVQ). NDVQ improves the robustness and generalization ability of the model by introducing learnable variances and establishing explicit safety margins for each code in the codebook. Specifically, NDVQ maps waveforms to the latent space and quantizes by selecting the most likely normal distribution. Each codebook entry is represented by a unique normal distribution defined by its mean and variance. Through this method, NDVQ can significantly improve audio quality in extremely low - bandwidth scenarios and perform excellently in zero - shot text - to - speech (TTS) tasks. ### Main problem summary: 1. **Perceptual quality and signal distortion**: Existing models have poor performance in terms of perceptual quality and signal distortion under extremely low - bandwidth conditions. 2. **Codebook sensitive to noise**: Traditional VQ methods are sensitive to noise, easily leading to codebook collapse and low codebook utilization. 3. **Impact on downstream tasks**: These problems have a negative impact on tasks such as codec - based speech synthesis. ### Solutions: - **Introduce normal - distribution - based vector quantization (NDVQ)**: By introducing learnable variances and establishing explicit safety margins for each code, improve the robustness and generalization ability of the model. - **Improve the quantization process**: Use the probability density function to select the most similar probability distribution and sample through the re - parameterization technique to obtain the quantization result. - **Optimize the training objective**: Combine reconstruction loss, discriminative loss and the modified codebook loss for training to ensure the performance of the model in low - bandwidth scenarios. Through these innovations, NDVQ shows better audio quality and stronger robustness in extremely low - bandwidth scenarios, especially performing excellently in zero - shot TTS tasks.