Continuous Speech Tokenizer in Text To Speech

Yixing Li,Ruobing Xie,Xingwu Sun,Yu Cheng,Zhanhui Kang
2024-10-22
Abstract:The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint training with text and have good compression efficiency. However, we found that the discrete speech tokenizer still suffers from information loss. Therefore, we propose a simple yet effective continuous speech tokenizer and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS). This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the Text - to - Speech (TTS) task, the existing discrete speech tokenizer has the problem of information loss. Specifically, the discrete speech tokenizer may lose information when processing both low - frequency and high - frequency components, which will affect the quality and continuity of the generated speech. Therefore, the author proposes a TTS model based on a continuous speech tokenizer, aiming to improve the quality and naturalness of the generated speech by retaining more information. ### Main contributions of the paper: 1. **Proposing a continuous speech tokenizer**: This tokenizer can generate continuous speech representations instead of traditional discrete representations. This helps to better preserve the information in the speech signal, especially the information in different frequency ranges. 2. **Constructing a TTS model based on a continuous speech tokenizer**: This model regards the TTS task as an autoregressive continuous speech token generation task, thereby improving the quality of the generated speech. 3. **Verifying the advantages of the continuous speech tokenizer**: Through experiments, the author shows the advantages of the continuous speech tokenizer in terms of information retention rate and sampling rate robustness, especially in terms of information retention in the high - frequency part. ### Experimental results: - **WER (Word Error Rate)**: On the LibriTTS dataset, the proposed model outperforms the baseline model in terms of the WER metric. - **SIM (Speaker Similarity)**: In terms of speaker similarity, the continuous speech tokenizer also performs better. - **EMoS (Estimated Mean Opinion Score)**: The speech generated by the continuous speech tokenizer is significantly higher than the baseline model in terms of the EMoS metric. - **CLVP Score**: The performance of the continuous speech tokenizer on the CLVP Score is also better than that of the baseline model. - **STOI (Short - Time Objective Intelligibility)**: The continuous speech tokenizer performs better in terms of the STOI metric, reflecting the higher clarity of the generated speech. - **Noisiness, Continuity, Loudness Quality, and Naturalness**: These metrics also indicate that the speech generated by the continuous speech tokenizer has better performance in terms of noise, continuity, loudness, and naturalness. ### Conclusion: The paper verifies the effectiveness and advantages of the continuous speech tokenizer in the TTS task through experiments, especially in terms of information retention and the quality of the generated speech. In addition, the author also discusses the directions of future research, including the application and training challenges in multimodal large - language models.