Abstract:Neural networks are commonly known to be vulnerable to adversarial attacks mounted through subtle perturbation on the input data. Recent development in voice-privacy protection has shown the positive use cases of the same technique to conceal speaker's voice attribute with additive perturbation signal generated by an adversarial network. This paper examines the reversibility property where an entity generating the adversarial perturbations is authorized to remove them and restore original speech (e.g., the speaker him/herself). A similar technique could also be used by an investigator to deanonymize a voice-protected speech to restore criminals' identities in security and forensic analysis. In this setting, the perturbation generative module is assumed to be known in the removal process. To this end, a joint training of perturbation generation and removal modules is proposed. Experimental results on the LibriSpeech dataset demonstrated that the subtle perturbations added to the original speech can be predicted from the anonymized speech while achieving the goal of privacy protection. By removing these perturbations from the anonymized sample, the original speech can be restored. Audio samples can be found in \url{<a class="link-external link-https" href="https://voiceprivacy.github.io/Perturbation-Generation-Removal/" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper mainly explores the reversibility issue of voice privacy protection through generating and removing adversarial perturbations. Specifically, it attempts to solve the following two key problems: 1. **How to ensure the recovery of the original voice while protecting voice privacy**: - The paper proposes a method so that the party generating adversarial perturbations can remove these perturbations when necessary, thus recovering the original voice. This feature is crucial for legally authorized entities (such as law - enforcement agencies) to recover speaker information in the voice in specific situations. 2. **Limitations of existing purification techniques**: - Existing adversarial perturbation purification techniques (such as adding noise, quantization, median smoothing, etc.) have obvious deficiencies in recovering the original voice. These methods usually introduce residual distortion in the purified voice and perform poorly in downstream tasks such as automatic speech recognition (ASR) and pitch extraction. Moreover, these methods are carried out without understanding the perturbation generation process, so they cannot fully recover the original voice. To solve these problems, the paper proposes a joint training framework in which the perturbation generation module and the perturbation removal module are trained simultaneously. In this way, the removal module can better understand the perturbation generation process, thus predicting and removing perturbations more effectively and finally recovering the original voice. ### Formula summary - **Adversarial perturbation generation formula**: \[ x' = x+\epsilon\cdot(n\odot m) \] where \(x\) is the original voice, \(x'\) is the adversarial voice, \(\epsilon\) is the attack intensity, \(n\) is the noise vector, \(m\) is the mask vector, and \(\odot\) represents element - wise multiplication. - **Loss function**: - Angular Loss: \[ L_{\text{angular}}=\frac{z^{T}z'}{\|z\|_{2}\|z'\|_{2}} \] where \(z\) and \(z'\) are the speaker embedding vectors of the original voice and the adversarial voice respectively. - Voice quality loss: \[ L_{\text{quality}}=(1 - \alpha)\|x' - x\|^{2}+\alpha\|m\|^{2} \] - Total loss function: \[ L_{\text{SSED}}=(1-\beta)L_{\text{angular}}+\beta L_{\text{quality}} \] - **Joint training loss function**: \[ L=(1-\theta)L_{\text{SSED}}+\theta L_{\text{rpt}} \] where \(L_{\text{rpt}}\) is the reverse perturbation loss, defined as: \[ L_{\text{rpt}}=(1-\gamma)L_{\text{mask}}+\gamma L_{\text{noise}} \] Specifically: \[ L_{\text{noise}}=\|n + n'\|^{2},\quad L_{\text{mask}}=\|m - m'\|^{2} \] Through these formulas and methods, the paper shows how to ensure the quality and content of the original voice can be recovered when needed while protecting voice privacy.

On the Generation and Removal of Speaker Adversarial Perturbation for Voice-Privacy Protection

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

Adversarial speech for voice privacy protection from Personalized Speech generation

Adversarial Perturbation Prediction for Real-Time Protection of Speech Privacy

MicPro: Microphone-based Voice Privacy Protection

Adversarial Privacy Protection on Speech Enhancement

UniAP: Protecting Speech Privacy with Non-Targeted Universal Adversarial Perturbations

Privacy-Utility Balanced Voice De-Identification Using Adversarial Examples

Query-Efficient Adversarial Attack with Low Perturbation Against End-to-End Speech Recognition Systems

Universal Adversarial Perturbations Generative Network for Speaker Recognition

NPU-NTU System for Voice Privacy 2024 Challenge

Universal Adversarial Perturbations for Speech Recognition Systems

A Non-intrusive and Adaptive Speaker De-Identification Scheme Using Adversarial Examples

Spoofing Speaker Verification System by Adversarial Examples Leveraging the Generalized Speaker Difference.

Defending Against Adversarial Attacks in Speaker Verification Systems

Adversarial Representation Learning for Robust Privacy Preservation in Audio

VSMask: Defending Against Voice Synthesis Attack via Real-Time Predictive Perturbation

Mitigating Unauthorized Speech Synthesis for Voice Protection

VoiceCloak

Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey

Speech Sanitizer: Speech Content Desensitization and Voice Anonymization