Abstract:The increasing use of cloud-based speech assistants has heightened the need for effective speech anonymization, which aims to obscure a speaker's identity while retaining critical information for subsequent tasks. One approach to achieving this is through voice conversion. While existing methods often emphasize complex architectures and training techniques, our research underscores the importance of loss functions inspired by the human auditory system. Our proposed loss functions are model-agnostic, incorporating handcrafted and deep learning-based features to effectively capture quality representations. Through objective and subjective evaluations, we demonstrate that a VQVAE-based model, enhanced with our perception-driven losses, surpasses the vanilla model in terms of naturalness, intelligibility, and prosody while maintaining speaker anonymity. These improvements are consistently observed across various datasets, languages, target speakers, and genders.

What problem does this paper attempt to address?

This paper attempts to address the issue of improving speech quality in the process of voice anonymization. Specifically, the authors focus on maintaining the naturalness, intelligibility, and prosody of speech through Voice Conversion (VC) technology while preserving the speaker's identity information and ensuring the speaker's anonymity. Existing methods often rely on complex architectures and training techniques, whereas this paper emphasizes the importance of perceptually-driven loss functions that better capture speech features related to the human auditory system. ### Main Contributions of the Paper: 1. **Proposing Perceptually-Driven Loss Functions**: The authors propose two types of perceptually-driven loss functions—handcrafted feature-based loss and representation-based loss. These loss functions aim to introduce inductive bias to achieve higher fidelity speech reconstruction. 2. **Improving Speech Quality**: Through experimental validation, the authors demonstrate that using perceptually-driven loss functions significantly improves the naturalness, intelligibility, and prosody of speech across different datasets, languages, target speakers, and genders while maintaining speaker anonymity. 3. **Model Agnosticism**: The proposed loss functions can be applied to any model, but the paper uses a Vector Quantized Variational Autoencoder (VQVAE)-based model for demonstration due to the relatively simple training of VQVAE. ### Main Methods: - **Handcrafted Feature-Based Loss**: Calculating the loss of formants, which are important features of vocal tract resonance frequencies and play a key role in defining vowel sound characteristics. - **Representation-Based Loss**: Utilizing intermediate representations of self-supervised deep learning models to capture key features of speech quality, such as timbre, prosody, clarity, and background noise. ### Experimental Results: - **Objective Evaluation**: Evaluated using metrics such as Character Error Rate (CER) and Equal Error Rate (EER) across multiple datasets and scenarios, the results show that using perceptually-driven loss functions significantly improves speech intelligibility and anonymity. - **Subjective Evaluation**: Through user studies, the naturalness, prosody retention, intelligibility, and anonymity of the speech were evaluated. The results indicate that most participants found that models using perceptually-driven loss functions performed better in terms of naturalness, prosody retention, and intelligibility. ### Conclusion: This paper proposes a model-agnostic perceptually-driven loss function that can significantly improve the quality of voice conversion without increasing model complexity. By integrating knowledge related to speech quality, these loss functions significantly enhance the performance of VQVAE models and are applicable to various conversion scenarios across different corpora, genders, accents, and languages.

Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses

Improving Voice Conversion for Dissimilar Speakers Using Perceptual Losses

Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example

Speaker Anonymization for Personal Information Protection Using Voice Conversion Techniques

V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization

Two-Stage Voice Anonymization for Enhanced Privacy

Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques

Deep Learning-based F0 Synthesis for Speaker Anonymization

Voice Conversion-based Privacy through Adversarial Information Hiding

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

Privacy-Utility Balanced Voice De-Identification Using Adversarial Examples

Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard

Evaluation of Speaker Anonymization on Emotional Speech

Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation

End-to-end streaming model for low-latency speech anonymization

Preserving spoken content in voice anonymisation with character-level vocoder conditioning

Distinguishable Speaker Anonymization Based on Formant and Fundamental Frequency Scaling