Automatic Voice Identification after Speech Resynthesis using PPG

Thibault Gaudier,Marie Tahon,Anthony Larcher,Yannick Estève
2024-08-05
Abstract:Speech resynthesis is a generic task for which we want to synthesize audio with another audio as input, which finds applications for media monitors and journalists.Among different tasks addressed by speech resynthesis, voice conversion preserves the linguistic information while modifying the identity of the speaker, and speech edition preserves the identity of the speaker but some words are <a class="link-external link-http" href="http://modified.In" rel="external noopener nofollow">this http URL</a> both cases, we need to disentangle speaker and phonetic contents in intermediate representations.Phonetic PosteriorGrams (PPG) are a frame-level probabilistic representation of phonemes, and are usually considered speaker-independent.This paper presents a PPG-based speech resynthesis system.A perceptive evaluation assesses that it produces correct audio quality.Then, we demonstrate that an automatic speaker verification model is not able to recover the source speaker after re-synthesis with PPG, even when the model is trained on synthetic data.
Sound,Artificial Intelligence,Neural and Evolutionary Computing,Audio and Speech Processing,Signal Processing
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is to evaluate whether the identity information of the source speaker can be recognized by the Automatic Speaker Verification (ASV) system after voice resynthesis using Phonetic PosteriorGrams (PPG). Specifically, the author hopes to explore the amount of source - speaker identity information contained in the PPG representation and verify whether PPG can effectively hide the identity characteristics of the source speaker, even in the case of voice resynthesis for a specific target speaker. To achieve this goal, the researchers carried out the following tasks: 1. **Construct a PPG - based voice resynthesis system**: They trained a PPG2Mel network, which can generate the voice signal of the target speaker from PPG. This process includes extracting PPG from natural speech, generating a mel - spectrogram through the PPG2Mel network, and then converting it into a time - domain audio signal through a vocoder such as WaveGlow. 2. **Perceptual quality assessment**: In order to ensure that the output audio quality of the PPG2Mel system is good enough for subsequent analysis, the researchers conducted a subjective quality assessment on different types of audio samples (including natural audio, audio processed only by the vocoder, text - based TTS audio, and PPG - based PPG2Mel audio). The results show that although there is some quality loss introduced by the vocoder, the overall audio quality generated by PPG2Mel is acceptable. 3. **Speaker verification experiments**: - **Naive ASV model test**: First, a naive ASV model without special training was used to try to identify the source speaker in the audio after PPG resynthesis. The results show that neither the naive nor the informed ASV model can successfully identify the source speaker, indicating that PPG resynthesis does effectively hide the identity characteristics of the source speaker. - **Informed ASV model test**: Next, the researchers trained an informed ASV model specifically for identifying the source speaker in PPG - resynthesized audio. Although this model performs well when dealing with natural speech, it still cannot effectively distinguish the source speaker when dealing with PPG - resynthesized audio, further confirming the effectiveness of PPG resynthesis. In summary, the main contribution of this paper is to prove that PPG can be an effective means to effectively hide the identity information of the source speaker during the voice resynthesis process, which is of great significance for application scenarios that require privacy protection or anonymization.