An HASM-Assisted Voice Disguise Scheme for Emotion Recognition of IoT-enabled Voice Interface
Wenjia Chen,Wenjuan Tang,Yan Meng,Yaoxue Zhang
DOI: https://doi.org/10.1109/jiot.2024.3406771
IF: 10.6
2024-01-01
IEEE Internet of Things Journal
Abstract:Voice-enabled devices are becoming increasingly prevalent in the Internet of Things (IoT). Speech emotion recognition (SER), as a key technology in modern voice-assisted applications, holds tremendous potential for delivering convenient and intelligent services. Unfortunately, SER Service providers may not only analyze the emotions in users’ speech but also examine their speech content and voice characteristics, posing greater privacy risks. Existing real-time voice disguise methods, such as pitch scaling and VTLN, provide significant technical support for the protection of voiceprint privacy but significantly impact the accuracy of SER. In this paper, we propose a harmonic amplitude spectrum mapping (HASM) assisted voice disguise scheme, which disguises the voice for voiceprint privacy preservation while safeguarding the emotional information within the voice. Specifically, we first conduct an in-depth analysis of the features in the speech that can reflect emotions and find that restoring harmonic amplitude spectrum features after altering the speaker’s voice is crucial for recovering emotions in speech. Based on this discovery, we then preprocess the original speech signals with pitch scaling and design a HASM-assisted disguise scheme based on mathematical theory expression to restore the emotions. Our HASM-assisted voice disguise scheme is validated on the Berlin Emotional Speech Database, the LibriSpeech dataset and VCTK dataset. At voiceprint privacy protection levels of 81.86%, 85.42%, and 91.15% in the LibriSpeech dataset and 96.83%, 98.25%, and 98.41% in the VCTK dataset, respectively, the SER accuracy of acoustic feature-based disguised speech decreases by only 4.19%, 6.21%, and 9.87%, and the end-to-end SER accuracy decreases by only 3.69%, 7.37%, and 8.86%, which is superior to other voice disguise methods.