Abstract:Since the advent of modern computing, researchers have striven to make the human–computer interface (HCI) as seamless as possible. Progress has been made on various fronts, e.g., the desktop metaphor (interface design) and natural language processing (input). One area receiving attention recently is voice activation and its corollary, computer-generated speech. Despite decades of research and development, most computer-generated voices remain easily identifiable as non-human. Prosody in speech has two primary components—intonation and rhythm—both often lacking in computer-generated voices. This research aims to enhance computer-generated text-to-speech algorithms by incorporating melodic and prosodic elements of human speech. This study explores a novel approach to add prosody by using machine learning, specifically an LSTM neural network, to add paralinguistic elements to a recorded or generated voice. The aim is to increase the realism of computer-generated text-to-speech algorithms, to enhance electronic reading applications, and improved artificial voices for those in need of artificial assistance to speak. A computer that is able to also convey meaning with a spoken audible announcement will also improve human-to-computer interactions. Applications for the use of such an algorithm may include improving high-definition audio codecs for telephony, renewing old recordings, and lowering barriers to the utilization of computing. This research deployed a prototype modular platform for digital speech improvement by analyzing and generalizing algorithms into a modular system through laboratory experiments to optimize combinations and performance in edge cases. The results were encouraging, with the LSTM-based encoder able to produce realistic speech. Further work will involve optimizing the algorithm and comparing its performance against other approaches.

Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling

Enhancing audio quality for expressive Neural Text-to-Speech

High Fidelity Speech Synthesis with Adversarial Networks

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

Multi-modal Adversarial Training for Zero-Shot Voice Cloning

Better speech synthesis through scaling

Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection

Prosody Modelling with Pre-trained Cross-utterance Representations for Improved Speech Synthesis

Voice Synthesis Improvement by Machine Learning of Natural Prosody

A Synthetic Corpus Generation Method for Neural Vocoder Training

Advances in Speech Vocoding for Text-to-Speech with Continuous Parameters

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech

Exploring synthetic data for cross-speaker style transfer in style representation based TTS

Text-aware and Context-aware Expressive Audiobook Speech Synthesis