Abstract:We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge. With the advent of voice cloning for reading-style speech synthesis, a new voice cloning paradigm for human-like and spontaneous speech synthesis is required. We, therefore, focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbre and speech disfluency. Specifically, we deal with filled pauses, a major source of speech disfluency, which is known to play an important role in speech generation and communication in psychology and linguistics. To comparatively evaluate personalized filled pause insertion and non-personalized filled pause prediction methods, we developed a speech synthesis method with a non-personalized external filled pause predictor trained with a multi-speaker corpus. The results clarify the position-word entanglement of filled pauses, i.e., the necessity of precisely predicting positions for naturalness and the necessity of precisely predicting words for individuality on the evaluation of synthesized speech.

What problem does this paper attempt to address?

The paper attempts to address the issue of filled pause prediction and insertion in personalized spontaneous speech synthesis. Specifically, the researchers focus on how to naturally insert filled pauses in synthesized speech to enhance its naturalness and personal characteristics. Filled pauses are very common in spontaneous contexts and can convey important information about the speaker's thought process and communication fluency. Therefore, the researchers aim to develop a method that can personalize the prediction and insertion of filled pauses by incorporating linguistic knowledge, thereby achieving more natural and personalized spontaneous speech synthesis. ### Main Research Objectives: 1. **Personalized Spontaneous Speech Synthesis**: Not only to reproduce the individual's voice timbre but also to replicate the individual's speech disfluencies, especially filled pauses. 2. **Comparison of Personalized and Non-Personalized Filled Pause Prediction Methods**: Evaluate the effectiveness of personalized filled pause insertion (using real filled pauses) and non-personalized filled pause prediction (using a prediction model) through experiments. 3. **Exploration of the Impact of Filled Pause Position and Lexicon on Synthesized Speech**: Investigate the different roles of filled pause position and lexicon in terms of naturalness and individuality. ### Research Background: - **Existing Technology**: Current text-to-speech synthesis technology in reading style has reached near-human levels, but spontaneous speech synthesis remains a challenge because it needs to handle speech disfluencies such as repetitions, rephrasing, and filled pauses. - **Importance of Filled Pauses**: Filled pauses play an important role in planning speech and smoothing communication in spontaneous contexts, and different speakers have different choices and positions for filled pauses. ### Methodology: - **Dataset**: Constructed a multi-speaker spontaneous speech corpus annotated with filled pauses. - **Model**: Developed a spontaneous speech synthesis method based on a sequence-to-sequence (seq2seq) model, combined with an external filled pause prediction model. - **Experimental Design**: Conducted subjective evaluation experiments to compare the performance of different methods in terms of naturalness, individuality, and listening effort. ### Experimental Results: - **Preliminary Evaluation**: Inserting filled pauses significantly improved the performance of synthesized speech in terms of naturalness, individuality, and listening effort. - **Effectiveness of Filled Pause Prediction**: Using a prediction model to insert filled pauses significantly improved the quality of synthesized speech compared to randomly inserting filled pauses. - **Impact of Filled Pause Position and Lexicon**: Accurately predicting the position of filled pauses is crucial for improving naturalness, while accurately predicting the lexicon of filled pauses is more important for enhancing individuality. ### Conclusion: The researchers demonstrated through experiments the importance of filled pauses in personalized spontaneous speech synthesis and proposed a filled pause prediction method that incorporates linguistic knowledge. Future work will focus on further improving the basic performance of spontaneous speech synthesis and automatically constructing spontaneous speech corpora.

Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion

Adaptive Text to Speech for Spontaneous Style

PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling

Applying Syntax$\unicode{x2013}$Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis

What makes a good pause? Investigating the turn-holding effects of fillers

The Pause Duration Prediction for Mandarin Text-to-speech System

STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent

Occurrences and Durations of Filled Pauses in Relation to Words and Silent Pauses in Spontaneous Speech

User-Driven Voice Generation and Editing through Latent Space Navigation

Building speech corpus with diverse voice characteristics for its prompt-based representation

Though this be hesitant, yet there is method in ’t: Effects of disfluency patterns in neural speech synthesis for cultural heritage presentations

Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation

How does a spontaneously speaking conversational agent affect user behavior?

An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis

Investigation of Deepfake Voice Detection Using Speech Pause Patterns: Algorithm Development and Validation

Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences

Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference

Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus

Residual-guided Personalized Speech Synthesis based on Face Image