Abstract:Since the advent of modern computing, researchers have striven to make the human–computer interface (HCI) as seamless as possible. Progress has been made on various fronts, e.g., the desktop metaphor (interface design) and natural language processing (input). One area receiving attention recently is voice activation and its corollary, computer-generated speech. Despite decades of research and development, most computer-generated voices remain easily identifiable as non-human. Prosody in speech has two primary components—intonation and rhythm—both often lacking in computer-generated voices. This research aims to enhance computer-generated text-to-speech algorithms by incorporating melodic and prosodic elements of human speech. This study explores a novel approach to add prosody by using machine learning, specifically an LSTM neural network, to add paralinguistic elements to a recorded or generated voice. The aim is to increase the realism of computer-generated text-to-speech algorithms, to enhance electronic reading applications, and improved artificial voices for those in need of artificial assistance to speak. A computer that is able to also convey meaning with a spoken audible announcement will also improve human-to-computer interactions. Applications for the use of such an algorithm may include improving high-definition audio codecs for telephony, renewing old recordings, and lowering barriers to the utilization of computing. This research deployed a prototype modular platform for digital speech improvement by analyzing and generalizing algorithms into a modular system through laboratory experiments to optimize combinations and performance in edge cases. The results were encouraging, with the LSTM-based encoder able to produce realistic speech. Further work will involve optimizing the algorithm and comparing its performance against other approaches.

Exploring expressivity and emotion with artificial voice and speech technologies

Expressivity and Speech Synthesis

Toward Synthesizing Expressive Mandarin Speech

Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis

Visuo-auditory Multimodal Emotional Structure to Improve Human-Robot-Interaction

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach

Emotional Talking Agent: System and Evaluation

Vocal emotion of humanoid robots: a study from brain mechanism.

Developing conversational Virtual Humans for social emotion elicitation based on large language models

Emotional Storytelling Using Virtual and Robotic Agents

Voice Cloning Using Artificial Intelligence and Machine Learning: A Review

The Acoustically Emotion-Aware Conversational Agent with Speech Emotion Recognition and Empathetic Responses

Emotional Prosody Control for Speech Generation

An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

Affective Voice Interaction and Artificial Intelligence: A Research Study on the Acoustic Features of Gender and the Emotional States of the PAD Model

Voice Synthesis Improvement by Machine Learning of Natural Prosody

Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

Expression in Speech: Natural and Synthetic

Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems