Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning

Hawraz A. Ahmad,Tarik A. Rashid
DOI: https://doi.org/10.1016/j.jksuci.2024.102131
IF: 9.006
2024-09-01
Journal of King Saud University - Computer and Information Sciences
Abstract:Synthesis of Text-to-speech (TTS) is a process that involves translating a natural language text into a speech. Speech synthesisers face a major challenge when recognizing the prosodic elements of written text, such as intonation (the rise and fall of the voice in speaking), and length. In contrast, continuous speech features are influenced by the personality and emotions of the artist. A database is maintained to store the synthesized speech pieces. Its output is determined by how similar the person utters the words and how capable they are of being implied. In the past few years, the field of text-to-speech synthesis has been heavily impacted by the emergence of deep learning, an AI technology that has gained widespread popularity. This review paper presents a taxonomy of models and architectures that are based on deep learning and discusses the various datasets that are utilised in the TTS process. It also covers the evaluation matrices that are commonly used. The paper ends with a look at the future directions of the system and reaches to some Deep learning models that give promising results in this field.
computer science, information systems
What problem does this paper attempt to address?