Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

Sotheara Leang,Anderson Augusma,Eric Castelli,Frédérique Letué,Sethserey Sam,Dominique Vaufreydaz
2024-09-24
Abstract:Human speech conveys prosody, linguistic content, and speaker identity. This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE) to deal with these speech components. This approach is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent. To do so, three separate branches compute embeddings for content, prosody, and speaker identity respectively. During synthesis, taking these embeddings, the decoder of the proposed architecture is conditioned on both speaker and prosody information, allowing for capturing more nuanced emotional states and precise adjustments to speaker identification. Findings indicate that this method outperforms most baseline techniques in preserving emotional information. However, it exhibits more limited performance on other voice privacy tasks, emphasizing the need for further improvements.
Computer Vision and Pattern Recognition,Signal Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the key challenge in **speaker anonymization**, that is, to preserve the emotional information and language content of the speech while protecting the privacy of the speaker's identity. Specifically, the authors propose a method based on Vector - Quantized Variational Auto - Encoder (VQ - VAE), combined with prosodic parameters (such as fundamental frequency F0 and energy), to achieve more effective speaker anonymization. ### Main problems: 1. **Protecting speaker privacy**: How to ensure that these treatments do not leak the speaker's identity information when removing or modifying the speaker's characteristics in the speech. 2. **Preserving emotional and language content**: During the anonymization process, how to maintain the integrity of the emotional information and language content conveyed in the speech, so that the anonymized speech still has naturalness and comprehensibility. ### Method overview: To address the above challenges, the authors propose an end - to - end network architecture, which includes the following modules: - **Content module**: Extract the content information of the speech through an encoder and vector quantization (VQ). - **Prosody module**: Extract and process prosodic parameters such as fundamental frequency (F0) and energy to enhance the model's ability to capture emotional expressions. - **Anonymization module**: Use the pre - trained ECAPA - TDNN model to generate a pseudo - x - vector, replacing the original x - vector, thereby changing the speaker's characteristics. - **Decoder module**: Use the HiFiGAN vocoder to synthesize the final anonymized speech. ### Key innovation points: - **Combining VQ - VAE and prosodic parameters**: By introducing prosodic parameters (F0 and energy), the model's ability to capture emotional expressions is improved, and the anonymization effect is enhanced at the same time. - **Multi - branch structure**: Calculate the embedding representations of content, prosody, and speaker identity separately, enabling the model to better separate these different speech characteristics. - **Generation of pseudo - x - vector**: Generate a pseudo - x - vector by randomly selecting the x - vector with the farthest distance, further confusing the speaker's identity. ### Experimental results: The experimental results show that this method performs excellently in preserving emotional information, but its performance in some privacy protection tasks is slightly inferior to other methods. This indicates that future research needs to further improve the anonymization effect of the model, especially the information decoupling problem when dealing with large - size codebooks. In conclusion, this paper aims to provide a new method for speech anonymization that can effectively protect the privacy of speakers and preserve emotional and language content by combining VQ - VAE and prosodic parameters.