Towards Zero-Shot Text-To-Speech for Arabic Dialects

Khai Duy Doan,Abdul Waheed,Muhammad Abdul-Mageed
2024-07-07
Abstract:Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{<a class="link-external link-https" href="https://docs.coqui.ai/en/latest/models/xtts.html" rel="external noopener nofollow">this https URL</a>}\footnote{<a class="link-external link-https" href="https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af73914db81f" rel="external noopener nofollow">this https URL</a>}\footnote{<a class="link-external link-https" href="https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc" rel="external noopener nofollow">this https URL</a>} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the issue of Zero-Shot Multi-Speaker Text-To-Speech (ZS-TTS) for Arabic dialects. Specifically: 1. **Dataset Adaptation**: The paper first adapts the existing large-scale dataset QASR to better suit the needs of Arabic speech synthesis. 2. **Dialect Recognition Model**: To improve the performance of the ZS-TTS model in a multi-dialect environment, the authors employ a series of Arabic dialect recognition models to explore the impact of predefined dialect labels on model performance. 3. **Model Fine-Tuning**: Based on the above work, the authors fine-tune the open-source XTTS model and further optimize it by incorporating supplementary Arabic dialect labels. 4. **Model Evaluation**: Finally, the authors conduct both automated and manual evaluations of the model on a dataset containing 31 unseen speakers and internal dialect data. The results show that the model excels in generating natural and fluent dialectal speech. Through these efforts, the paper fills a research gap in the field of Arabic ZS-TTS and demonstrates the great potential in this emerging research direction.