Abstract:Discrete Audio codecs (or audio tokenizers) have recently regained interest due to the ability of Large Language Models (LLMs) to learn their compressed acoustic representations. Various publicly available trainable discrete tokenizers recently demonstrated impressive results for audio tokenization, yet they mostly require high token rates to gain high-quality reconstruction. In this study, we fine-tuned an open-source general audio RVQGAN model using diverse open-source speech data, considering various recording conditions and quality levels. The resulting wideband (24kHz) speech-only model achieves speech reconstruction, which is nearly indistinguishable from PCM (pulse-code modulation) with a rate of 150-300 tokens per second (1500-3000 bps). The evaluation used comprehensive English speech data encompassing different recording conditions, including studio settings. Speech samples are made publicly available in <a class="link-external link-http" href="http://ibm.biz/IS24SpeechRVQ" rel="external noopener nofollow">this http URL</a> . The model is officially released in <a class="link-external link-https" href="https://huggingface.co/ibm/DAC.speech.v1.0" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: to reduce the bit rate of the discrete speech encoder (also known as the audio tokenizer) while maintaining high - quality speech reconstruction. Specifically, although the existing discrete audio tokenizers can achieve high - quality audio reconstruction, they usually require a relatively high bit rate (for example, more than 600 tokens per second), which is a challenge for the efficient training of large - scale language models (LLMs). Therefore, the goal of this paper is to achieve high - quality speech reconstruction that is almost indistinguishable from the original pulse - code modulation (PCM) at a lower bit rate (1500 - 3000 bps, that is, 150 - 300 tokens per second) by improving and fine - tuning the existing RVQGAN model. ### Main contributions 1. **High - quality speech reconstruction at a low bit rate**: - The author achieved high - quality speech reconstruction in a low - bit - rate setting (1500 - 3000 bps) by fine - tuning the general - purpose audio DAC model using diverse open - source speech datasets. 2. **Extensive evaluation**: - Evaluations were carried out on multiple English speech datasets, demonstrating high - quality reconstruction on the 1.5 kbps model and perceptually transparent reconstruction on the 3 kbps model. 3. **Ablation study**: - Detailed ablation experiments were conducted to explore the impact of training data of different quality levels and recording conditions on model performance, and evaluations were carried out on various test data. ### Method overview - **Model architecture**: Based on RVQGAN (Residual Vector Quantization with Generative Adversarial Network), which is a neural network architecture that combines residual vector quantization and a generative adversarial network. - **Data selection**: Multiple high - quality, medium - quality, and low - quality English speech datasets were used to ensure that the model can perform well under different recording conditions. - **Training details**: Strictly follow the training process of the original DAC model, but remove the quantizer dropout during the fine - tuning process to avoid negative impacts on the performance of the pre - trained model. ### Results - **Objective evaluation**: The performance of the model was evaluated through multiple objective metrics (such as mel loss, STFT loss, PESQ, STOI), and the results showed significant improvements on all test sets, especially when using a smaller number of quantization codebooks. - **Subjective evaluation**: Through the MUSHRA test, 16 subjects were invited to conduct subjective auditory evaluations on the 4 - codebook (3 kbps) and 2 - codebook (1.5 kbps) models, and the results indicated that the perceptual difference between the output of the retrained 4 - codebook system and the original recording was not significant. ### Summary This paper proposes an improved RVQ - GAN audio codec, which is specifically optimized for speech data and can achieve high - quality speech reconstruction at a lower bit rate. In addition, the study also emphasizes the importance of balanced training data and verifies through ablation experiments that data selection has a significant impact on model performance.

Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer

High-Fidelity Audio Compression with Improved RVQGAN

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models

Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predictive Coding

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space

Exploring the Benefits of Tokenization of Discrete Acoustic Units

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

RepCodec: A Speech Representation Codec for Speech Tokenization