Abstract:Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, model size, and robustness. Therefore, we propose MobileSpeech, which is a fast, lightweight, and robust zero-shot text-to-speech system based on mobile devices for the first time. Specifically: 1) leveraging discrete codec, we design a parallel speech mask decoder module called SMD, which incorporates hierarchical information from the speech codec and weight mechanisms across different codec layers during the generation process. Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation. 2) For speaker prompts, we extract fine-grained prompt duration from the prompt speech and incorporate text, prompt speech by cross attention in SMD. We demonstrate the effectiveness of MobileSpeech on multilingual datasets at different levels, achieving state-of-the-art results in terms of generating speed and speech quality. MobileSpeech achieves RTF of 0.09 on a single A100 GPU and we have successfully deployed MobileSpeech on mobile devices. Audio samples are available at \url{<a class="link-external link-https" href="https://mobilespeech.github.io/" rel="external noopener nofollow">this https URL</a>} .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to implement a fast and high - quality zero - shot text - to - speech (TTS) synthesis system on mobile devices. Specifically, existing zero - shot TTS systems are mainly deployed in the cloud. Although these systems can generate high - quality speech, they have deficiencies in inference speed, model size, and robustness. Therefore, the paper proposes MobileSpeech, which is a fast, lightweight, and robust zero - shot TTS system designed specifically for mobile devices. ### Main Problems and Challenges 1. **Inference Speed**: The real - time factor (RTF) of existing zero - shot TTS systems on mobile devices is usually much higher than that on high - performance GPUs, and the inference speed needs to be significantly improved. 2. **Model Size**: In order to be deployed on mobile or edge devices, the model must be small enough, and the runtime memory footprint should also be as low as possible. 3. **High Similarity and Diversity**: Zero - shot TTS systems need to be able to clone voice timbre and intonation with a few - second prompts and generate diverse voices with the same text input. 4. **High Quality**: In order to improve the naturalness of the synthesized speech, the model needs to pay attention to details, such as the frequency bin between adjacent harmonics and strong duration modeling capabilities. 5. **Robustness**: The system should minimize the occurrence of missing or repeating words. ### Solutions To achieve the above goals, the paper proposes a series of innovations: 1. **Speech Mask Codec Decoder (SMD) Module**: - Using the hierarchical structure of the discrete codec, a parallel speech mask decoder module is designed. - High - level probability masks are introduced to simulate the process of information flow from less to more, in order to bridge the gap between text and speech. 2. **Speaker Prompt Module**: - Extract fine - grained prompt durations from the prompt speech and combine the text and prompt speech into the SMD through a cross - attention mechanism. - Abandon coarse - grained voice prompt guidance and instead use fine - grained prompt duration values in front of the duration predictor. 3. **Training and Inference Optimization**: - Use the mean - squared - error (MSE) loss to optimize the duration predictor and duration extractor. - Use the cross - entropy loss function to optimize the SMD module, considering both the first channel and randomly selected discrete acoustic tokens. - In the inference stage, generate acoustic tokens for the first channel through a confidence - sampling scheme and use a greedy strategy to generate tokens for the remaining channels. ### Experimental Results The paper conducted experiments on multiple datasets, including English and Mandarin datasets, to verify the effectiveness of MobileSpeech. The experimental results show that MobileSpeech has reached the state - of - the - art level in terms of generation speed, speech quality, and robustness. - **Generation Speed**: MobileSpeech achieved an RTF of 0.09 on a single A100 GPU, which is 11 times faster than VALL - E and 4 times faster than MegaTTS and NaturalSpeech2. - **Speech Quality**: In the MOS score, MobileSpeech performs well in terms of quality, rhythm, and voice timbre similarity. - **Robustness**: MobileSpeech performs best in the WER metric and even outperforms VALL - E. In conclusion, through the design of the efficient SMD module and Speaker Prompt module, the paper has successfully implemented a fast and high - quality zero - shot TTS system on mobile devices.

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

FlashSpeech: Efficient Zero-Shot Speech Synthesis

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

MRMI-TTS: Multi-reference audios and Mutual Information Driven Zero-shot Voice cloning

A GPU-accelerated Real-Time Human Voice Separation Framework for Mobile Phones

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

EfficientSpeech: An On-Device Text to Speech Model

Zero-shot Cross-lingual Voice Transfer for TTS

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens