Abstract:A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. While existing systems are good at anonymizing speaker embeddings, they are not designed to preserve emotion. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker's identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks.

What problem does this paper attempt to address?

The paper primarily explores how to improve general decoupling-based speaker anonymization systems to better preserve emotional information while protecting speaker identity privacy. The paper proposes two strategies to enhance emotional retention: 1. **Integrating Pre-trained Emotion Encoders**: This method involves extracting emotion embeddings from pre-trained emotion encoders and integrating them into the anonymized speech generation process, which helps retain emotional cues. Although this method can improve emotional retention, it may slightly compromise privacy protection capabilities because emotional features inevitably contain speaker characteristics. 2. **Emotion Compensation Strategy**: As a post-processing step, this strategy modifies already anonymized speaker embeddings to reintroduce emotional traits that may have been lost or weakened during the speaker embedding anonymization process. Specifically, researchers hypothesize that there are specific directions in the latent space of speaker embeddings that can be manipulated to adjust basic emotion types (e.g., happiness). To implement this compensation strategy, researchers use Support Vector Machines (SVM) to find the direction for emotion compensation. Each basic emotion (e.g., happiness, anger, neutral, sadness) is trained with a separate SVM to classify whether the speaker embedding has the corresponding emotion. Then, during the inference phase, the Orthogonal Householder Neural Network (OHNN) is first used to anonymize the speaker embeddings, while an emotion indicator predicts the emotion in the input original speaker embeddings. Next, the corresponding SVM is selected, and the anonymized speaker embeddings are modified along the normal vector of the SVM hyperplane to compensate for the reduced emotional attributes. Additionally, the paper introduces the background of related work, including the task requirements of the Voice Privacy Challenge 2024, evaluation metrics, and several baseline speaker anonymization methods. Among them, the OH (Orthogonal Householder Neural Network) is particularly mentioned as the base system for improvement because it has shown a good balance between privacy and utility in previous evaluations. In summary, the paper aims to address the issue of how to make speaker anonymization systems not only effectively hide speaker identity but also better retain emotional and other paralinguistic features.

Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization

Toward emotional speaker recognition: framework and preliminary results

Natural-Emotion Gmm Transformation Algorithm For Emotional Speaker Recognition

Learning Polynomial Function Based Neutral-Emotion Gmm Transformation For Emotional Speaker Recognition

Emotional Speaker Identification By Humans And Machines

Evaluation of Speaker Anonymization on Emotional Speech

MUSA: Multi-lingual Speaker Anonymization via Serial Disentanglement

NPU-NTU System for Voice Privacy 2024 Challenge

Speaker anonymization using orthogonal Householder neural network

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions

Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix

A Benchmark for Multi-speaker Anonymization

Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

A Step Towards Preserving Speakers' Identity While Detecting Depression Via Speaker Disentanglement

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis

SEC-GAN for robust speaker recognition with emotional state dismatch

Speaker Anonymization for Personal Information Protection Using Voice Conversion Techniques

Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization