Abstract:Since the advent of modern computing, researchers have striven to make the human–computer interface (HCI) as seamless as possible. Progress has been made on various fronts, e.g., the desktop metaphor (interface design) and natural language processing (input). One area receiving attention recently is voice activation and its corollary, computer-generated speech. Despite decades of research and development, most computer-generated voices remain easily identifiable as non-human. Prosody in speech has two primary components—intonation and rhythm—both often lacking in computer-generated voices. This research aims to enhance computer-generated text-to-speech algorithms by incorporating melodic and prosodic elements of human speech. This study explores a novel approach to add prosody by using machine learning, specifically an LSTM neural network, to add paralinguistic elements to a recorded or generated voice. The aim is to increase the realism of computer-generated text-to-speech algorithms, to enhance electronic reading applications, and improved artificial voices for those in need of artificial assistance to speak. A computer that is able to also convey meaning with a spoken audible announcement will also improve human-to-computer interactions. Applications for the use of such an algorithm may include improving high-definition audio codecs for telephony, renewing old recordings, and lowering barriers to the utilization of computing. This research deployed a prototype modular platform for digital speech improvement by analyzing and generalizing algorithms into a modular system through laboratory experiments to optimize combinations and performance in edge cases. The results were encouraging, with the LSTM-based encoder able to produce realistic speech. Further work will involve optimizing the algorithm and comparing its performance against other approaches.

Human Voice Pitch Estimation: A Convolutional Network with Auto-Labeled and Synthetic Data

A Deep Learning Based Analysis-Synthesis Framework For Unison Singing

PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM

Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network

DJCM: A Deep Joint Cascade Model for Singing Voice Separation and Vocal Pitch Estimation

Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding

A Comparative Study of Pitch Extraction Algorithms on a Large Variety of Singing Sounds

Pitch Preservation In Singing Voice Synthesis

Self-Supervised Representations for Singing Voice Conversion

RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music

Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings

A Novel Pitch Extraction Based on Jointly Trained Deep BLSTM Recurrent Neural Networks with Bottleneck Features

A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction

An objective evaluation method of vocal singing effect based on artificial intelligence technology

Research On Singing Voice Detection Based On A Long-Term Recurrent Convolutional Network With Vocal Separation And Temporal Smoothing

A Deep-Learning Based Framework for Source Separation, Analysis, and Synthesis of Choral Ensembles

Polyphonic pitch detection with convolutional recurrent neural networks

Data Efficient Voice Cloning for Neural Singing Synthesis

Towards Solving The Bottleneck Of Pitch-Based Singing Voice Separation

Voice Synthesis Improvement by Machine Learning of Natural Prosody