Abstract:Since the advent of modern computing, researchers have striven to make the human–computer interface (HCI) as seamless as possible. Progress has been made on various fronts, e.g., the desktop metaphor (interface design) and natural language processing (input). One area receiving attention recently is voice activation and its corollary, computer-generated speech. Despite decades of research and development, most computer-generated voices remain easily identifiable as non-human. Prosody in speech has two primary components—intonation and rhythm—both often lacking in computer-generated voices. This research aims to enhance computer-generated text-to-speech algorithms by incorporating melodic and prosodic elements of human speech. This study explores a novel approach to add prosody by using machine learning, specifically an LSTM neural network, to add paralinguistic elements to a recorded or generated voice. The aim is to increase the realism of computer-generated text-to-speech algorithms, to enhance electronic reading applications, and improved artificial voices for those in need of artificial assistance to speak. A computer that is able to also convey meaning with a spoken audible announcement will also improve human-to-computer interactions. Applications for the use of such an algorithm may include improving high-definition audio codecs for telephony, renewing old recordings, and lowering barriers to the utilization of computing. This research deployed a prototype modular platform for digital speech improvement by analyzing and generalizing algorithms into a modular system through laboratory experiments to optimize combinations and performance in edge cases. The results were encouraging, with the LSTM-based encoder able to produce realistic speech. Further work will involve optimizing the algorithm and comparing its performance against other approaches.

Acoustic Statistical Modeling Based Speech Synthesis Technologies

Acoustic statistical modeling based new generation speech synthesis technology

Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis

The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007

Trainable Unit Selection Speech Synthesis under Statistical Framework

USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method

Integrating Articulatory Features into HMM-Based Parametric Speech Synthesis

Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Multi-source Based Acoustic Model for Speech Synthesis.

Study about Chinese Speech Synthesis Algorithm and Acoustic Model Based on Wireless Communication Network

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

Improving HMM Based Speech Synthesis by Reducing Over-Smoothing Problems

Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

The USTC System for Blizzard Challenge 2008

Articulatory Control of HMM-based Parametric Speech Synthesis Driven by Phonetic Knowledge

Voice Synthesis Improvement by Machine Learning of Natural Prosody

EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech