Abstract:Speech data carries a range of personal information, such as the speaker's identity and emotional state. These attributes can be used for malicious purposes. With the development of virtual assistants, a new generation of privacy threats has emerged. Current studies have addressed the topic of preserving speech privacy. One of them, the VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology. The task selected for the VoicePrivacy 2020 Challenge (VPC) is about speaker anonymization. The goal is to hide the source speaker's identity while preserving the linguistic information. The baseline of the VPC makes use of a voice conversion. This paper studies the impact of the speaker anonymization baseline system of the VPC on emotional information present in speech utterances. Evaluation is performed following the VPC rules regarding the attackers' knowledge about the anonymization system. Our results show that the VPC baseline system does not suppress speakers' emotions against informed attackers. When comparing anonymized speech to original speech, the emotion recognition performance is degraded by 15\% relative to IEMOCAP data, similar to the degradation observed for automatic speech recognition used to evaluate the preservation of the linguistic information.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to evaluate the impact of the speaker anonymization baseline system in the VoicePrivacy Challenge (VPC) on emotional information in emotional speech. Specifically, the main research questions include:
1. **Speaker identity hiding**: How to effectively hide the identity of the original speaker while maintaining the language information.
2. **Retention or removal of emotional information**: Will anonymization affect the emotional information in the speech? If so, to what extent?
3. **Robustness against adversarial attacks**: How does the anonymization system perform when facing different types of attackers (ignorant attackers and informed attackers)?
#### Background and motivation
With the popularization of voice - controlled applications (such as smart speakers), a large amount of voice data is collected, processed, and stored on centralized servers. Voice data contains a lot of personal sensitive information, such as age, gender, health status, personality traits, socioeconomic status, geographical origin, biometric identity, emotion, etc. Therefore, protecting voice privacy has become crucial.
In addition, recent regulations (such as the General Data Protection Regulation (GDPR) in the European Union) also emphasize privacy protection and the protection of personal data. To meet these challenges, the VPC framework provides a set of dedicated protocols, metrics, datasets, and baseline models to evaluate voice privacy protection technologies.
### Research objectives
The research objective of this paper is to evaluate the impact of the speaker anonymization method of the VPC baseline system on emotional speech, in particular:
- **Emotional recognition performance after anonymization**: By comparing the performance of the original speech and the anonymized speech in the emotional recognition task, evaluate the impact of anonymization on emotional information.
- **Performance in different attack scenarios**: Evaluate the robustness and effectiveness of the anonymization system in the cases of ignorant attackers and informed attackers.
### Method overview
1. **Anonymization framework**: Use the VPC baseline system for speaker anonymization, which is based on x - vector and voice conversion technology.
2. **F0 transformation enhancement**: Introduce F0 linear transformation and random deformation to further adjust the fundamental frequency of the anonymized speech.
3. **Attack scenario setting**: Consider different situations of ignorant attackers and informed attackers to evaluate the performance of the anonymization system.
4. **Experimental design**: Use the IEMOCAP dataset for experiments to evaluate the performance changes in emotional recognition and automatic speech recognition (ASR).
### Main findings
- **Decline in emotional recognition performance**: In the anonymized speech, the emotional recognition performance has decreased by about 15% compared to the original speech.
- **Decline in ASR performance**: The anonymized speech also shows a certain performance decline in the automatic speech recognition task, but to a lesser extent (about 13%).
- **Limited effect of F0 transformation**: Simple F0 transformation cannot effectively hide emotional information, and further research on the adjustment of other parameters (such as duration and energy) is required.
### Conclusion
This study shows that although the existing speaker anonymization methods can hide the speaker's identity to a certain extent, they do not completely remove the emotional information in the speech. Future research needs to explore more effective anonymization techniques to better protect privacy and retain useful language information.