EmoAttack: Utilizing Emotional Voice Conversion for Speech Backdoor Attacks on Deep Speech Classification Models

Wenhan Yao,Zedong XingXiarun Chen,Jia Liu,yongqiang He,Weiping Wen
2024-09-06
Abstract:Deep speech classification tasks, mainly including keyword spotting and speaker verification, play a crucial role in speech-based human-computer interaction. Recently, the security of these technologies has been demonstrated to be vulnerable to backdoor attacks. Specifically speaking, speech samples are attacked by noisy disruption and component modification in present triggers. We suggest that speech backdoor attacks can strategically focus on emotion, a higher-level subjective perceptual attribute inherent in speech. Furthermore, we proposed that emotional voice conversion technology can serve as the speech backdoor attack trigger, and the method is called EmoAttack. Based on this, we conducted attack experiments on two speech classification tasks, showcasing that EmoAttack method owns impactful trigger effectiveness and its remarkable attack success rate and accuracy variance. Additionally, the ablation experiments found that speech with intensive emotion is more suitable to be targeted for attacks.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use emotional voice conversion technology to conduct backdoor attacks on deep - voice classification models (such as keyword recognition and speaker verification tasks). Specifically, the author reveals the security risks of voice classification models when facing the tampering of emotional attributes, and proposes a new attack method - EmoAttack, which implants a backdoor through emotional voice conversion (EVC) as a trigger mechanism. ### Problem Background In recent years, voice classification tasks (such as keyword recognition and speaker verification) have played an important role in human - computer interaction. However, the security of these technologies has been proven to be vulnerable to backdoor attacks. Traditional backdoor attacks are usually achieved by adding noise to voice samples or modifying voice components, but these methods are easy to be detected. In order to improve the concealment and effectiveness of the attack, the author proposes the idea of using emotion, a high - level subjective perception attribute, to carry out the attack. ### Proposed Method The author proposes a new method named EmoAttack, which uses emotional voice conversion technology to convert the emotional attribute of the voice from one emotion to another while keeping other voice components unchanged. In this way, the attacker can make the target model make wrong predictions without significantly changing the voice content. ### Main Contributions 1. **Reveal New Security Risks**: The author points out that by tampering with the emotional attributes of the voice, new security risks can be introduced. 2. **Propose EVC - Triggered Attack Paradigm**: The author proposes the backdoor attack method EmoAttack based on emotional voice conversion. 3. **Comparison of Attack Effects of Different Emotions**: Experiments show that the greater the emotional difference, the better the attack effect, especially strong emotions (such as anger, happiness) are more likely to succeed. ### Experimental Results The author conducted experiments on two voice classification tasks, namely keyword recognition (KWS) and speaker verification (SVs). The experimental results show that the EmoAttack method has a high attack success rate (ASR) and a low number of poisoned samples (PN), and will not significantly affect the normal performance of the model. In addition, through MOS and SER accuracy evaluation, it is found that the poisoned samples generated by EmoAttack are of high quality and difficult to be detected by humans or machines. In conclusion, this paper aims to explore how to use emotional voice conversion technology to conduct concealed and effective backdoor attacks on deep - voice classification models, and reveals the security vulnerabilities of voice classification models when facing the tampering of emotional attributes.