Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

Slava Shechtman,Avihu Dekel
DOI: https://doi.org/10.21437/Interspeech.2024-2366
2024-10-11
Abstract:Discrete Audio codecs (or audio tokenizers) have recently regained interest due to the ability of Large Language Models (LLMs) to learn their compressed acoustic representations. Various publicly available trainable discrete tokenizers recently demonstrated impressive results for audio tokenization, yet they mostly require high token rates to gain high-quality reconstruction. In this study, we fine-tuned an open-source general audio RVQGAN model using diverse open-source speech data, considering various recording conditions and quality levels. The resulting wideband (24kHz) speech-only model achieves speech reconstruction, which is nearly indistinguishable from PCM (pulse-code modulation) with a rate of 150-300 tokens per second (1500-3000 bps). The evaluation used comprehensive English speech data encompassing different recording conditions, including studio settings. Speech samples are made publicly available in <a class="link-external link-http" href="http://ibm.biz/IS24SpeechRVQ" rel="external noopener nofollow">this http URL</a> . The model is officially released in <a class="link-external link-https" href="https://huggingface.co/ibm/DAC.speech.v1.0" rel="external noopener nofollow">this https URL</a>
Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: to reduce the bit rate of the discrete speech encoder (also known as the audio tokenizer) while maintaining high - quality speech reconstruction. Specifically, although the existing discrete audio tokenizers can achieve high - quality audio reconstruction, they usually require a relatively high bit rate (for example, more than 600 tokens per second), which is a challenge for the efficient training of large - scale language models (LLMs). Therefore, the goal of this paper is to achieve high - quality speech reconstruction that is almost indistinguishable from the original pulse - code modulation (PCM) at a lower bit rate (1500 - 3000 bps, that is, 150 - 300 tokens per second) by improving and fine - tuning the existing RVQGAN model. ### Main contributions 1. **High - quality speech reconstruction at a low bit rate**: - The author achieved high - quality speech reconstruction in a low - bit - rate setting (1500 - 3000 bps) by fine - tuning the general - purpose audio DAC model using diverse open - source speech datasets. 2. **Extensive evaluation**: - Evaluations were carried out on multiple English speech datasets, demonstrating high - quality reconstruction on the 1.5 kbps model and perceptually transparent reconstruction on the 3 kbps model. 3. **Ablation study**: - Detailed ablation experiments were conducted to explore the impact of training data of different quality levels and recording conditions on model performance, and evaluations were carried out on various test data. ### Method overview - **Model architecture**: Based on RVQGAN (Residual Vector Quantization with Generative Adversarial Network), which is a neural network architecture that combines residual vector quantization and a generative adversarial network. - **Data selection**: Multiple high - quality, medium - quality, and low - quality English speech datasets were used to ensure that the model can perform well under different recording conditions. - **Training details**: Strictly follow the training process of the original DAC model, but remove the quantizer dropout during the fine - tuning process to avoid negative impacts on the performance of the pre - trained model. ### Results - **Objective evaluation**: The performance of the model was evaluated through multiple objective metrics (such as mel loss, STFT loss, PESQ, STOI), and the results showed significant improvements on all test sets, especially when using a smaller number of quantization codebooks. - **Subjective evaluation**: Through the MUSHRA test, 16 subjects were invited to conduct subjective auditory evaluations on the 4 - codebook (3 kbps) and 2 - codebook (1.5 kbps) models, and the results indicated that the perceptual difference between the output of the retrained 4 - codebook system and the original recording was not significant. ### Summary This paper proposes an improved RVQ - GAN audio codec, which is specifically optimized for speech data and can achieve high - quality speech reconstruction at a lower bit rate. In addition, the study also emphasizes the importance of balanced training data and verifies through ablation experiments that data selection has a significant impact on model performance.