One-Class Neural Network With Directed Statistics Pooling for Spoofing Speech Detection
Guoyuan Lin,Weiqi Luo,Da Luo,Jiwu Huang
DOI: https://doi.org/10.1109/tifs.2024.3352429
IF: 7.231
2024-02-02
IEEE Transactions on Information Forensics and Security
Abstract:Existing deep learning models for spoofing speech detection often struggle to effectively generalize to unseen spoofing attacks that were not present during the training stage. Moreover, the presence of class imbalance further compounds this issue by biasing the learning process towards seen attack samples. To address these challenges, we present an innovative end-to-end model called One-Class Neural Network with Directed Statistics Pooling (OCNet-DSP). Our model incorporates a feature cropping operation to attenuate high-frequency components, mitigating the risk of overfitting. Additionally, leveraging the time-frequency characteristics of speech signals, we introduce a directed statistics pooling layer that extracts more effective features for distinguishing between bonafide and spoofing classes. We also propose the Threshold One-class Softmax loss, which mitigates class imbalance by reducing the optimization weight of spoofing samples during training. Extensive comparative results demonstrate that the proposed model outperforms all existing single models, achieving an equal error rate of 0.44% and a minimum detection cost function of 0.0145 for the ASVspoof 2019 logical access database. Moreover, the proposed ensemble version, which accommodates speech inputs of varying lengths in each submodel, maintains state-of-the-art performance among reproducible ensemble models. Additionally, numerous ablation experiments, along with a cross-dataset experiment, are conducted to validate the rationality and effectiveness of the proposed model.
computer science, theory & methods,engineering, electrical & electronic