Abstract:Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, model size, and robustness. Therefore, we propose MobileSpeech, which is a fast, lightweight, and robust zero-shot text-to-speech system based on mobile devices for the first time. Specifically: 1) leveraging discrete codec, we design a parallel speech mask decoder module called SMD, which incorporates hierarchical information from the speech codec and weight mechanisms across different codec layers during the generation process. Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation. 2) For speaker prompts, we extract fine-grained prompt duration from the prompt speech and incorporate text, prompt speech by cross attention in SMD. We demonstrate the effectiveness of MobileSpeech on multilingual datasets at different levels, achieving state-of-the-art results in terms of generating speed and speech quality. MobileSpeech achieves RTF of 0.09 on a single A100 GPU and we have successfully deployed MobileSpeech on mobile devices. Audio samples are available at \url{<a class="link-external link-https" href="https://mobilespeech.github.io/" rel="external noopener nofollow">this https URL</a>} .
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to implement a fast and high - quality zero - shot text - to - speech (TTS) synthesis system on mobile devices. Specifically, existing zero - shot TTS systems are mainly deployed in the cloud. Although these systems can generate high - quality speech, they have deficiencies in inference speed, model size, and robustness. Therefore, the paper proposes MobileSpeech, which is a fast, lightweight, and robust zero - shot TTS system designed specifically for mobile devices.
### Main Problems and Challenges
1. **Inference Speed**: The real - time factor (RTF) of existing zero - shot TTS systems on mobile devices is usually much higher than that on high - performance GPUs, and the inference speed needs to be significantly improved.
2. **Model Size**: In order to be deployed on mobile or edge devices, the model must be small enough, and the runtime memory footprint should also be as low as possible.
3. **High Similarity and Diversity**: Zero - shot TTS systems need to be able to clone voice timbre and intonation with a few - second prompts and generate diverse voices with the same text input.
4. **High Quality**: In order to improve the naturalness of the synthesized speech, the model needs to pay attention to details, such as the frequency bin between adjacent harmonics and strong duration modeling capabilities.
5. **Robustness**: The system should minimize the occurrence of missing or repeating words.
### Solutions
To achieve the above goals, the paper proposes a series of innovations:
1. **Speech Mask Codec Decoder (SMD) Module**:
- Using the hierarchical structure of the discrete codec, a parallel speech mask decoder module is designed.
- High - level probability masks are introduced to simulate the process of information flow from less to more, in order to bridge the gap between text and speech.
2. **Speaker Prompt Module**:
- Extract fine - grained prompt durations from the prompt speech and combine the text and prompt speech into the SMD through a cross - attention mechanism.
- Abandon coarse - grained voice prompt guidance and instead use fine - grained prompt duration values in front of the duration predictor.
3. **Training and Inference Optimization**:
- Use the mean - squared - error (MSE) loss to optimize the duration predictor and duration extractor.
- Use the cross - entropy loss function to optimize the SMD module, considering both the first channel and randomly selected discrete acoustic tokens.
- In the inference stage, generate acoustic tokens for the first channel through a confidence - sampling scheme and use a greedy strategy to generate tokens for the remaining channels.
### Experimental Results
The paper conducted experiments on multiple datasets, including English and Mandarin datasets, to verify the effectiveness of MobileSpeech. The experimental results show that MobileSpeech has reached the state - of - the - art level in terms of generation speed, speech quality, and robustness.
- **Generation Speed**: MobileSpeech achieved an RTF of 0.09 on a single A100 GPU, which is 11 times faster than VALL - E and 4 times faster than MegaTTS and NaturalSpeech2.
- **Speech Quality**: In the MOS score, MobileSpeech performs well in terms of quality, rhythm, and voice timbre similarity.
- **Robustness**: MobileSpeech performs best in the WER metric and even outperforms VALL - E.
In conclusion, through the design of the efficient SMD module and Speaker Prompt module, the paper has successfully implemented a fast and high - quality zero - shot TTS system on mobile devices.