Fast and Lightweight Voice Replay Attack Detection Via Time-frequency Spectrum Difference

Ruiwen He,Yushi Cheng,Zhicong Zheng,Xiaoyu Ji,Wenyuan Xu
DOI: https://doi.org/10.1109/jiot.2024.3406962
IF: 10.6
2024-01-01
IEEE Internet of Things Journal
Abstract:Due to the open nature of voice and voice interface, an adversary can spoof voice recognition systems by replaying pre-recorded voice commands from legitimate users, known as the voice replay attack. Existing detection methods against voice replay attacks mainly rely on extra hardware to determine the sound source or require excessive computing resources to train a classifier with abundant acoustic features. In this paper, we propose Anti-Replay, a fast and lightweight detection system for voice replay attacks. To overcome the challenge of redundant classification features and complex calculation, we first investigate the time-frequency spectrum difference between the genuine human voice and the replayed audio caused by the non-linear distortion of the attacker’s microphones and speakers. Then, we design 5 types with a total of 77 features in both the time and frequency domains and propose a convolutional neural network classifier SE-ResNet50 for attack detection. Evaluations against the datasets of ASVspoof2017, ASVspoof2019, and ASVspoof2021 demonstrate that Anti-Replay can achieve an average equal error rate (EER) of 1.36% across three datasets. Meanwhile, Anti-Replay decreases the training time by 52.3% and 90.2% and decreases the model size by 83.5% and 99.9% compared with the baseline model CQCC-GMM and the state-of-the-art method Res2Net. We have also confirmed that our system is effective in detecting the adaptive replay attack.
What problem does this paper attempt to address?