Robust Spoof Speech Detection Based on Multi-Scale Feature Aggregation and Dynamic Convolution.

Haochen Wu,Jie Zhang,Zhentao Zhang,Wenting Zhao,Bin Gu,Wu Guo
DOI: https://doi.org/10.1109/ICASSP48485.2024.10446612
2024-01-01
Abstract:Spoof speech detection (SSD) can help to protect an automatic speaker recognition system against malicious attacks. However, there exists a great diversity in the spoof utterances generated by different text-to-speech and voice conversion algorithms, resulting in a poor generality of an SSD system to unseen spoofing attacks. To address this problem, we integrate multi-scale feature aggregation (MFA) and dynamic convolution operations into the anti-spoofing framework to detect different local and global artifacts of unseen spoofing attacks. The proposed framework mainly contains eight stacked MFA blocks, where in each block the light-Res2Net module is used to capture multi-scale features and the convolutional kernel is dynamically generated by the local and global statistical information of the inputs. Results on two benchmark datasets (i.e., ADD 2023 Fake Audio Detection and ASVspoof 2021 Logical Access) show the superiority of the proposed method over existing state-of-the-art systems.
What problem does this paper attempt to address?