On the Generation and Removal of Speaker Adversarial Perturbation for Voice-Privacy Protection

Chenyang Guo,Liping Chen,Zhuhai Li,Kong Aik Lee,Zhen-Hua Ling,Wu Guo
2024-12-12
Abstract:Neural networks are commonly known to be vulnerable to adversarial attacks mounted through subtle perturbation on the input data. Recent development in voice-privacy protection has shown the positive use cases of the same technique to conceal speaker's voice attribute with additive perturbation signal generated by an adversarial network. This paper examines the reversibility property where an entity generating the adversarial perturbations is authorized to remove them and restore original speech (e.g., the speaker him/herself). A similar technique could also be used by an investigator to deanonymize a voice-protected speech to restore criminals' identities in security and forensic analysis. In this setting, the perturbation generative module is assumed to be known in the removal process. To this end, a joint training of perturbation generation and removal modules is proposed. Experimental results on the LibriSpeech dataset demonstrated that the subtle perturbations added to the original speech can be predicted from the anonymized speech while achieving the goal of privacy protection. By removing these perturbations from the anonymized sample, the original speech can be restored. Audio samples can be found in \url{<a class="link-external link-https" href="https://voiceprivacy.github.io/Perturbation-Generation-Removal/" rel="external noopener nofollow">this https URL</a>}.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper mainly explores the reversibility issue of voice privacy protection through generating and removing adversarial perturbations. Specifically, it attempts to solve the following two key problems: 1. **How to ensure the recovery of the original voice while protecting voice privacy**: - The paper proposes a method so that the party generating adversarial perturbations can remove these perturbations when necessary, thus recovering the original voice. This feature is crucial for legally authorized entities (such as law - enforcement agencies) to recover speaker information in the voice in specific situations. 2. **Limitations of existing purification techniques**: - Existing adversarial perturbation purification techniques (such as adding noise, quantization, median smoothing, etc.) have obvious deficiencies in recovering the original voice. These methods usually introduce residual distortion in the purified voice and perform poorly in downstream tasks such as automatic speech recognition (ASR) and pitch extraction. Moreover, these methods are carried out without understanding the perturbation generation process, so they cannot fully recover the original voice. To solve these problems, the paper proposes a joint training framework in which the perturbation generation module and the perturbation removal module are trained simultaneously. In this way, the removal module can better understand the perturbation generation process, thus predicting and removing perturbations more effectively and finally recovering the original voice. ### Formula summary - **Adversarial perturbation generation formula**: \[ x' = x+\epsilon\cdot(n\odot m) \] where \(x\) is the original voice, \(x'\) is the adversarial voice, \(\epsilon\) is the attack intensity, \(n\) is the noise vector, \(m\) is the mask vector, and \(\odot\) represents element - wise multiplication. - **Loss function**: - Angular Loss: \[ L_{\text{angular}}=\frac{z^{T}z'}{\|z\|_{2}\|z'\|_{2}} \] where \(z\) and \(z'\) are the speaker embedding vectors of the original voice and the adversarial voice respectively. - Voice quality loss: \[ L_{\text{quality}}=(1 - \alpha)\|x' - x\|^{2}+\alpha\|m\|^{2} \] - Total loss function: \[ L_{\text{SSED}}=(1-\beta)L_{\text{angular}}+\beta L_{\text{quality}} \] - **Joint training loss function**: \[ L=(1-\theta)L_{\text{SSED}}+\theta L_{\text{rpt}} \] where \(L_{\text{rpt}}\) is the reverse perturbation loss, defined as: \[ L_{\text{rpt}}=(1-\gamma)L_{\text{mask}}+\gamma L_{\text{noise}} \] Specifically: \[ L_{\text{noise}}=\|n + n'\|^{2},\quad L_{\text{mask}}=\|m - m'\|^{2} \] Through these formulas and methods, the paper shows how to ensure the quality and content of the original voice can be recovered when needed while protecting voice privacy.