Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses

Suhita Ghosh,Tim Thiele,Frederic Lorbeer,Frank Dreyer,Sebastian Stober
2024-10-21
Abstract:The increasing use of cloud-based speech assistants has heightened the need for effective speech anonymization, which aims to obscure a speaker's identity while retaining critical information for subsequent tasks. One approach to achieving this is through voice conversion. While existing methods often emphasize complex architectures and training techniques, our research underscores the importance of loss functions inspired by the human auditory system. Our proposed loss functions are model-agnostic, incorporating handcrafted and deep learning-based features to effectively capture quality representations. Through objective and subjective evaluations, we demonstrate that a VQVAE-based model, enhanced with our perception-driven losses, surpasses the vanilla model in terms of naturalness, intelligibility, and prosody while maintaining speaker anonymity. These improvements are consistently observed across various datasets, languages, target speakers, and genders.
Artificial Intelligence,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to address the issue of improving speech quality in the process of voice anonymization. Specifically, the authors focus on maintaining the naturalness, intelligibility, and prosody of speech through Voice Conversion (VC) technology while preserving the speaker's identity information and ensuring the speaker's anonymity. Existing methods often rely on complex architectures and training techniques, whereas this paper emphasizes the importance of perceptually-driven loss functions that better capture speech features related to the human auditory system. ### Main Contributions of the Paper: 1. **Proposing Perceptually-Driven Loss Functions**: The authors propose two types of perceptually-driven loss functions—handcrafted feature-based loss and representation-based loss. These loss functions aim to introduce inductive bias to achieve higher fidelity speech reconstruction. 2. **Improving Speech Quality**: Through experimental validation, the authors demonstrate that using perceptually-driven loss functions significantly improves the naturalness, intelligibility, and prosody of speech across different datasets, languages, target speakers, and genders while maintaining speaker anonymity. 3. **Model Agnosticism**: The proposed loss functions can be applied to any model, but the paper uses a Vector Quantized Variational Autoencoder (VQVAE)-based model for demonstration due to the relatively simple training of VQVAE. ### Main Methods: - **Handcrafted Feature-Based Loss**: Calculating the loss of formants, which are important features of vocal tract resonance frequencies and play a key role in defining vowel sound characteristics. - **Representation-Based Loss**: Utilizing intermediate representations of self-supervised deep learning models to capture key features of speech quality, such as timbre, prosody, clarity, and background noise. ### Experimental Results: - **Objective Evaluation**: Evaluated using metrics such as Character Error Rate (CER) and Equal Error Rate (EER) across multiple datasets and scenarios, the results show that using perceptually-driven loss functions significantly improves speech intelligibility and anonymity. - **Subjective Evaluation**: Through user studies, the naturalness, prosody retention, intelligibility, and anonymity of the speech were evaluated. The results indicate that most participants found that models using perceptually-driven loss functions performed better in terms of naturalness, prosody retention, and intelligibility. ### Conclusion: This paper proposes a model-agnostic perceptually-driven loss function that can significantly improve the quality of voice conversion without increasing model complexity. By integrating knowledge related to speech quality, these loss functions significantly enhance the performance of VQVAE models and are applicable to various conversion scenarios across different corpora, genders, accents, and languages.