Abstract:Human speech conveys prosody, linguistic content, and speaker identity. This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE) to deal with these speech components. This approach is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent. To do so, three separate branches compute embeddings for content, prosody, and speaker identity respectively. During synthesis, taking these embeddings, the decoder of the proposed architecture is conditioned on both speaker and prosody information, allowing for capturing more nuanced emotional states and precise adjustments to speaker identification. Findings indicate that this method outperforms most baseline techniques in preserving emotional information. However, it exhibits more limited performance on other voice privacy tasks, emphasizing the need for further improvements.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the key challenge in **speaker anonymization**, that is, to preserve the emotional information and language content of the speech while protecting the privacy of the speaker's identity. Specifically, the authors propose a method based on Vector - Quantized Variational Auto - Encoder (VQ - VAE), combined with prosodic parameters (such as fundamental frequency F0 and energy), to achieve more effective speaker anonymization. ### Main problems: 1. **Protecting speaker privacy**: How to ensure that these treatments do not leak the speaker's identity information when removing or modifying the speaker's characteristics in the speech. 2. **Preserving emotional and language content**: During the anonymization process, how to maintain the integrity of the emotional information and language content conveyed in the speech, so that the anonymized speech still has naturalness and comprehensibility. ### Method overview: To address the above challenges, the authors propose an end - to - end network architecture, which includes the following modules: - **Content module**: Extract the content information of the speech through an encoder and vector quantization (VQ). - **Prosody module**: Extract and process prosodic parameters such as fundamental frequency (F0) and energy to enhance the model's ability to capture emotional expressions. - **Anonymization module**: Use the pre - trained ECAPA - TDNN model to generate a pseudo - x - vector, replacing the original x - vector, thereby changing the speaker's characteristics. - **Decoder module**: Use the HiFiGAN vocoder to synthesize the final anonymized speech. ### Key innovation points: - **Combining VQ - VAE and prosodic parameters**: By introducing prosodic parameters (F0 and energy), the model's ability to capture emotional expressions is improved, and the anonymization effect is enhanced at the same time. - **Multi - branch structure**: Calculate the embedding representations of content, prosody, and speaker identity separately, enabling the model to better separate these different speech characteristics. - **Generation of pseudo - x - vector**: Generate a pseudo - x - vector by randomly selecting the x - vector with the farthest distance, further confusing the speaker's identity. ### Experimental results: The experimental results show that this method performs excellently in preserving emotional information, but its performance in some privacy protection tasks is slightly inferior to other methods. This indicates that future research needs to further improve the anonymization effect of the model, especially the information decoupling problem when dealing with large - size codebooks. In conclusion, this paper aims to provide a new method for speech anonymization that can effectively protect the privacy of speakers and preserve emotional and language content by combining VQ - VAE and prosodic parameters.

Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

Speaker Anonymization for Personal Information Protection Using Voice Conversion Techniques

Evaluation of Speaker Anonymization on Emotional Speech

NPU-NTU System for Voice Privacy 2024 Challenge

Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses

Speaker anonymization using neural audio codec language models

Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix

Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization

Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization

Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions

End-to-end streaming model for low-latency speech anonymization

Privacy-oriented manipulation of speaker representations

V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization

Exploratory Evaluation of Speech Content Masking

A Benchmark for Multi-speaker Anonymization

Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques

Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices