Abstract:The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint training with text and have good compression efficiency. However, we found that the discrete speech tokenizer still suffers from information loss. Therefore, we propose a simple yet effective continuous speech tokenizer and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS). This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the Text - to - Speech (TTS) task, the existing discrete speech tokenizer has the problem of information loss. Specifically, the discrete speech tokenizer may lose information when processing both low - frequency and high - frequency components, which will affect the quality and continuity of the generated speech. Therefore, the author proposes a TTS model based on a continuous speech tokenizer, aiming to improve the quality and naturalness of the generated speech by retaining more information. ### Main contributions of the paper: 1. **Proposing a continuous speech tokenizer**: This tokenizer can generate continuous speech representations instead of traditional discrete representations. This helps to better preserve the information in the speech signal, especially the information in different frequency ranges. 2. **Constructing a TTS model based on a continuous speech tokenizer**: This model regards the TTS task as an autoregressive continuous speech token generation task, thereby improving the quality of the generated speech. 3. **Verifying the advantages of the continuous speech tokenizer**: Through experiments, the author shows the advantages of the continuous speech tokenizer in terms of information retention rate and sampling rate robustness, especially in terms of information retention in the high - frequency part. ### Experimental results: - **WER (Word Error Rate)**: On the LibriTTS dataset, the proposed model outperforms the baseline model in terms of the WER metric. - **SIM (Speaker Similarity)**: In terms of speaker similarity, the continuous speech tokenizer also performs better. - **EMoS (Estimated Mean Opinion Score)**: The speech generated by the continuous speech tokenizer is significantly higher than the baseline model in terms of the EMoS metric. - **CLVP Score**: The performance of the continuous speech tokenizer on the CLVP Score is also better than that of the baseline model. - **STOI (Short - Time Objective Intelligibility)**: The continuous speech tokenizer performs better in terms of the STOI metric, reflecting the higher clarity of the generated speech. - **Noisiness, Continuity, Loudness Quality, and Naturalness**: These metrics also indicate that the speech generated by the continuous speech tokenizer has better performance in terms of noise, continuity, loudness, and naturalness. ### Conclusion: The paper verifies the effectiveness and advantages of the continuous speech tokenizer in the TTS task through experiments, especially in terms of information retention and the quality of the generated speech. In addition, the author also discusses the directions of future research, including the application and training challenges in multimodal large - language models.

Continuous Speech Tokenizer in Text To Speech

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners

SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Continuous Speech Synthesis using per-token Latent Diffusion

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

dMel: Speech Tokenization made Simple

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

TokSing: Singing Voice Synthesis based on Discrete Tokens

Exploring Continuous Integrate-and-Fire for Adaptive Simultaneous Speech Translation

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Advances in Speech Vocoding for Text-to-Speech with Continuous Parameters

Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Semi-continuous Segmental Probability Modeling for Continuous Speech Recognition.