Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

Xiaoxiao Miao,Yuxiang Zhang,Xin Wang,Natalia Tomashenko,Donny Cheng Lock Soh,Ian Mcloughlin
2024-08-12
Abstract:A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. While existing systems are good at anonymizing speaker embeddings, they are not designed to preserve emotion. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker's identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily explores how to improve general decoupling-based speaker anonymization systems to better preserve emotional information while protecting speaker identity privacy. The paper proposes two strategies to enhance emotional retention: 1. **Integrating Pre-trained Emotion Encoders**: This method involves extracting emotion embeddings from pre-trained emotion encoders and integrating them into the anonymized speech generation process, which helps retain emotional cues. Although this method can improve emotional retention, it may slightly compromise privacy protection capabilities because emotional features inevitably contain speaker characteristics. 2. **Emotion Compensation Strategy**: As a post-processing step, this strategy modifies already anonymized speaker embeddings to reintroduce emotional traits that may have been lost or weakened during the speaker embedding anonymization process. Specifically, researchers hypothesize that there are specific directions in the latent space of speaker embeddings that can be manipulated to adjust basic emotion types (e.g., happiness). To implement this compensation strategy, researchers use Support Vector Machines (SVM) to find the direction for emotion compensation. Each basic emotion (e.g., happiness, anger, neutral, sadness) is trained with a separate SVM to classify whether the speaker embedding has the corresponding emotion. Then, during the inference phase, the Orthogonal Householder Neural Network (OHNN) is first used to anonymize the speaker embeddings, while an emotion indicator predicts the emotion in the input original speaker embeddings. Next, the corresponding SVM is selected, and the anonymized speaker embeddings are modified along the normal vector of the SVM hyperplane to compensate for the reduced emotional attributes. Additionally, the paper introduces the background of related work, including the task requirements of the Voice Privacy Challenge 2024, evaluation metrics, and several baseline speaker anonymization methods. Among them, the OH (Orthogonal Householder Neural Network) is particularly mentioned as the base system for improvement because it has shown a good balance between privacy and utility in previous evaluations. In summary, the paper aims to address the issue of how to make speaker anonymization systems not only effectively hide speaker identity but also better retain emotional and other paralinguistic features.