Abstract:Since the advent of modern computing, researchers have striven to make the human–computer interface (HCI) as seamless as possible. Progress has been made on various fronts, e.g., the desktop metaphor (interface design) and natural language processing (input). One area receiving attention recently is voice activation and its corollary, computer-generated speech. Despite decades of research and development, most computer-generated voices remain easily identifiable as non-human. Prosody in speech has two primary components—intonation and rhythm—both often lacking in computer-generated voices. This research aims to enhance computer-generated text-to-speech algorithms by incorporating melodic and prosodic elements of human speech. This study explores a novel approach to add prosody by using machine learning, specifically an LSTM neural network, to add paralinguistic elements to a recorded or generated voice. The aim is to increase the realism of computer-generated text-to-speech algorithms, to enhance electronic reading applications, and improved artificial voices for those in need of artificial assistance to speak. A computer that is able to also convey meaning with a spoken audible announcement will also improve human-to-computer interactions. Applications for the use of such an algorithm may include improving high-definition audio codecs for telephony, renewing old recordings, and lowering barriers to the utilization of computing. This research deployed a prototype modular platform for digital speech improvement by analyzing and generalizing algorithms into a modular system through laboratory experiments to optimize combinations and performance in edge cases. The results were encouraging, with the LSTM-based encoder able to produce realistic speech. Further work will involve optimizing the algorithm and comparing its performance against other approaches.

Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

A Review of Deep Learning Based Speech Synthesis

A Survey on Neural Speech Synthesis

An overview of text-to-speech systems and media applications

A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture

Advancements in Arabic Text-to-Speech Systems: A 22-Year Literature Review

Review of end-to-end speech synthesis technology based on deep learning

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages

On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition

Development and Evaluation of Speech Synthesis System Based on Deep Learning Models

Deep Voice: Real-time Neural Text-to-Speech

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset

Text-To-Speech Synthesis In The Wild

Voice Synthesis Improvement by Machine Learning of Natural Prosody

Text to speech synthesis

Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments

SR-TTS: a rhyme-based end-to-end speech synthesis system