Abstract:Deep learning-based speech synthesis evolves by employing a sequence-to-sequence (seq2seq) structure with an attention mechanism. The seq2seq speech synthesis model consists of a pair of the encoder for delivering the linguistic features and the decoder for predicting the mel-spectrogram, and learns the alignment between text and speech through the attention mechanism. The decoder predicts the mel-spectrogram by an autoregressive flow that considers the current input and what they have learned from previous inputs. This is beneficial when processing the sequential data, as in speech synthesis. However, the recursive generation of speech typically requires extensive training time, which slows the speed of synthesis. To overcome these obstacles, we propose a non-autoregressive framework for fully parallel deep convolutional neural speech synthesis. Firstly, we design a new synthesis paradigm that integrates a time-varying metatemplate (TVMT), whose length is modeled with a separate conditional distribution, to prepare the decoder input. The decoding step converts the TVMT into spectral features, which eliminates the autoregressive flow. Secondly, we propose a structure that uses multiple decoders interconnected by up-down chains with an iterative attention mechanism. The decoder chains distribute the burden of decoding, progressively infusing the information obtained from the training target example into the chains to refine the predicted spectral features at each decoding step. For each decoder, the attention mechanism is repeatedly applied to produce the elaborated alignment between the linguistic features and the TVMT, which is gradually transformed into the spectral features. The proposed architecture substantially improves the synthesis speed, and the resulting speech quality is superior to that of a conventional autoregressive model.

Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning

LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus

A generative framework for conversational laughter: Its 'language model' and laughter sound synthesis

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

LaughTalk: Expressive 3D Talking Head Generation with Laughter

Can a robot laugh with you?: Shared laughter generation for empathetic spoken dialogue

Laugh Betrays You? Learning Robust Speaker Representation From Speech Containing Non-Verbal Fragments

A New Perspective on Smiling and Laughter Detection: Intensity Levels Matter

Laughter and smiling facial expression modelling for the generation of virtual affective behavior

Generating Diverse Realistic Laughter for Interactive Art

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

Happy or Evil Laughter? Analysing a Database of Natural Audio Samples

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Impact of annotation modality on label quality and model performance in the automatic assessment of laughter in-the-wild

Analysis of Co-Laughter Gesture Relationship on RGB videos in Dyadic Conversation Contex

Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach

Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis