Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

Yuta Matsunaga,Takaaki Saeki,Shinnosuke Takamichi,Hiroshi Saruwatari
2023-09-19
Abstract:We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge. With the advent of voice cloning for reading-style speech synthesis, a new voice cloning paradigm for human-like and spontaneous speech synthesis is required. We, therefore, focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbre and speech disfluency. Specifically, we deal with filled pauses, a major source of speech disfluency, which is known to play an important role in speech generation and communication in psychology and linguistics. To comparatively evaluate personalized filled pause insertion and non-personalized filled pause prediction methods, we developed a speech synthesis method with a non-personalized external filled pause predictor trained with a multi-speaker corpus. The results clarify the position-word entanglement of filled pauses, i.e., the necessity of precisely predicting positions for naturalness and the necessity of precisely predicting words for individuality on the evaluation of synthesized speech.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address the issue of filled pause prediction and insertion in personalized spontaneous speech synthesis. Specifically, the researchers focus on how to naturally insert filled pauses in synthesized speech to enhance its naturalness and personal characteristics. Filled pauses are very common in spontaneous contexts and can convey important information about the speaker's thought process and communication fluency. Therefore, the researchers aim to develop a method that can personalize the prediction and insertion of filled pauses by incorporating linguistic knowledge, thereby achieving more natural and personalized spontaneous speech synthesis. ### Main Research Objectives: 1. **Personalized Spontaneous Speech Synthesis**: Not only to reproduce the individual's voice timbre but also to replicate the individual's speech disfluencies, especially filled pauses. 2. **Comparison of Personalized and Non-Personalized Filled Pause Prediction Methods**: Evaluate the effectiveness of personalized filled pause insertion (using real filled pauses) and non-personalized filled pause prediction (using a prediction model) through experiments. 3. **Exploration of the Impact of Filled Pause Position and Lexicon on Synthesized Speech**: Investigate the different roles of filled pause position and lexicon in terms of naturalness and individuality. ### Research Background: - **Existing Technology**: Current text-to-speech synthesis technology in reading style has reached near-human levels, but spontaneous speech synthesis remains a challenge because it needs to handle speech disfluencies such as repetitions, rephrasing, and filled pauses. - **Importance of Filled Pauses**: Filled pauses play an important role in planning speech and smoothing communication in spontaneous contexts, and different speakers have different choices and positions for filled pauses. ### Methodology: - **Dataset**: Constructed a multi-speaker spontaneous speech corpus annotated with filled pauses. - **Model**: Developed a spontaneous speech synthesis method based on a sequence-to-sequence (seq2seq) model, combined with an external filled pause prediction model. - **Experimental Design**: Conducted subjective evaluation experiments to compare the performance of different methods in terms of naturalness, individuality, and listening effort. ### Experimental Results: - **Preliminary Evaluation**: Inserting filled pauses significantly improved the performance of synthesized speech in terms of naturalness, individuality, and listening effort. - **Effectiveness of Filled Pause Prediction**: Using a prediction model to insert filled pauses significantly improved the quality of synthesized speech compared to randomly inserting filled pauses. - **Impact of Filled Pause Position and Lexicon**: Accurately predicting the position of filled pauses is crucial for improving naturalness, while accurately predicting the lexicon of filled pauses is more important for enhancing individuality. ### Conclusion: The researchers demonstrated through experiments the importance of filled pauses in personalized spontaneous speech synthesis and proposed a filled pause prediction method that incorporates linguistic knowledge. Future work will focus on further improving the basic performance of spontaneous speech synthesis and automatically constructing spontaneous speech corpora.