Abstract:At present, Text-to-speech (TTS) systems that are trained with high-quality transcribed speech data using end-to-end neural models can generate speech that is intelligible, natural, and closely resembles human speech. These models are trained with relatively large single-speaker professionally recorded audio, typically extracted from audiobooks. Meanwhile, due to the scarcity of freely available speech corpora of this kind, a larger gap exists in Arabic TTS research and development. Most of the existing freely available Arabic speech corpora are not suitable for TTS training as they contain multi-speaker casual speech with variations in recording conditions and quality, whereas the corpus curated for speech synthesis are generally small in size and not suitable for training state-of-the-art end-to-end models. In a move towards filling this gap in resources, we present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic. The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated. The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz. In this paper, we describe the process of corpus creation and provide details of corpus statistics and a comparison with existing resources. Furthermore, we develop two TTS systems based on Grad-TTS and Glow-TTS and illustrate the performance of the resulting systems via subjective and objective evaluations. The corpus will be made publicly available at <a class="link-external link-http" href="http://www.clartts.com" rel="external noopener nofollow">this http URL</a> for research purposes, along with the baseline TTS systems demo.

ArmanTTS single-speaker Persian dataset

ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

A multi-purpose audio-visual corpus for multi-modal persian speech recognition: The Arman-AV dataset

MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset

Word-level Persian Lipreading Dataset

KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset

Towards Zero-Shot Text-To-Speech for Arabic Dialects

ArTST: Arabic Text and Speech Transformer

A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture

ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus

ArmanEmo: A Persian Dataset for Text-based Emotion Detection

The huya multi-speaker and multi-style speech synthesis system for m2voc challenge 2020

An overview of text-to-speech systems and media applications

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Development and Evaluation of Speech Synthesis System Based on Deep Learning Models

Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS

Automatic Speech Recognition for Speech Assessment of Persian Preschool Children

Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning

Armenian Speech Recognition System: Acoustic and Language Models

TPPoet: Transformer-Based Persian Poem Generation using Minimal Data and Advanced Decoding Techniques