ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Zhongxu Wang,Yujia Wang,Mingzhu Li,Hua Huang
DOI: https://doi.org/10.1145/3664647.3681097
2024-01-01
Abstract:We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for human-like speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We also design a multi-dimensional style mapping network to extract speaking styles from the articulatory representations, guided by which variation predictors could predict the final mel spectrogram output. To validate the effectiveness of our approach, we conducted comprehensive experiments and analyses using the widely recognized speech corpus, such as LJSpeech and LibriTTS datasets, yielding promising similarity enhancement between the generated results and the target speaker's voice and prosody.
What problem does this paper attempt to address?