Abstract:Since the advent of modern computing, researchers have striven to make the human–computer interface (HCI) as seamless as possible. Progress has been made on various fronts, e.g., the desktop metaphor (interface design) and natural language processing (input). One area receiving attention recently is voice activation and its corollary, computer-generated speech. Despite decades of research and development, most computer-generated voices remain easily identifiable as non-human. Prosody in speech has two primary components—intonation and rhythm—both often lacking in computer-generated voices. This research aims to enhance computer-generated text-to-speech algorithms by incorporating melodic and prosodic elements of human speech. This study explores a novel approach to add prosody by using machine learning, specifically an LSTM neural network, to add paralinguistic elements to a recorded or generated voice. The aim is to increase the realism of computer-generated text-to-speech algorithms, to enhance electronic reading applications, and improved artificial voices for those in need of artificial assistance to speak. A computer that is able to also convey meaning with a spoken audible announcement will also improve human-to-computer interactions. Applications for the use of such an algorithm may include improving high-definition audio codecs for telephony, renewing old recordings, and lowering barriers to the utilization of computing. This research deployed a prototype modular platform for digital speech improvement by analyzing and generalizing algorithms into a modular system through laboratory experiments to optimize combinations and performance in edge cases. The results were encouraging, with the LSTM-based encoder able to produce realistic speech. Further work will involve optimizing the algorithm and comparing its performance against other approaches.

Improving High Quality TTS using Circular Linear Prediction and Constant Pitch Transform

High quality, lightweight and adaptable TTS using LPCNet

Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion

Hierarchical Prosody Modeling and Control in Non-Autoregressive Parallel Neural TTS

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

Incremental FastPitch: Chunk-based High Quality Text to Speech

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

Learning Prosodic Patterns for Mandarin Speech Synthesis

Enhancing audio quality for expressive Neural Text-to-Speech

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments

EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

High quality Chinese text-to-speech system - BEYOND

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

Prosodic Parameter Manipulation in TTS generated speech for Controlled Speech Generation

Pitch Prediction for Mandarin TTS with Mutual Prosodic Constraint

Speaking rate attention-based duration prediction for speed control TTS

Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling

Voice Synthesis Improvement by Machine Learning of Natural Prosody

Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech