NRAdapt: Noise-Robust Adaptive Text to Speech Using Untranscribed Data

Ming Cheng,Shun Lei,Dongyang Dai,Zhiyong Wu,Dading Chong
DOI: https://doi.org/10.1109/ijcnn60899.2024.10651038
2024-01-01
Abstract:Voice cloning, also known as personalized voice synthesis, is a significant branch of speech synthesis. It involves synthesizing speech with the same vocal characteristics as the target speaker. The field of voice cloning mainly faces two challenges: 1. Typically, only a small amount of voice data from the target speaker is available, necessitating the cloning of the speaker’s timbre using limited available data; 2. The voice data of the target speaker is often recorded with non-professional equipment in noisy environments, posing a significant challenge to the robustness of voice cloning systems. In this paper, we propose a noise-robust voice adaptation method based on speaker-independent bottleneck features.The method contains: 1. Using a disentanglement module based on autoencoder architecture to disentangle the timbre information from the speech data, resulting in the extraction of corresponding speaker-independent bottleneck features with environmental information; 2. With the help of the above module, the Text2BN (Text to Bottleneck) module is trained with high-quality voice data to establish a mapping from text to clean speaker-independent bottleneck features; 3. The decoder is fine-tuned using noisy target speaker speech data to adapt to the target speaker and is cascaded with the Text2BN module to synthesize clean audio. The disentanglement module does not require text transcriptions and not mandate the use of artificially paired clean/noisy datasets, enabling large-scale pre-training with massive real-world, untranscribed, noisy datasets to further enhance the model’s noise robustness and capabilities of decoupling timbre. During the cloning phase, there are no requirements for the recording conditions of the target speaker’s speech data or for text transcriptions, aligning more closely with real application scenarios. Since our model does not directly work on the noise level, it effectively avoids issues of insufficient robustness to out-of-distribution noise. Experiments demonstrate that the method can effectively utilize noisy target speaker speech data for voice cloning, achieving the preponderance of both speech quality and similarity in the synthesized speech.
What problem does this paper attempt to address?