Investigating Raw Wave Deep Neural Networks for End-to-End Speaker Spoofing Detection
Heinrich Dinkel,Yanmin Qian,Kai Yu
DOI: https://doi.org/10.1109/taslp.2018.2851155
2018-01-01
IEEE/ACM Transactions on Audio Speech and Language Processing
Abstract:Recent advances in automatic speaker verification (ASV) lead to an increased interest in securing these systems for real-world applications. Malicious spoofing attempts against ASV systems can lead to serious security breaches. A spoofing attack within the context of ASV is a condition in which a (potentially harmful) person successfully masks as another, to the ASV system already known person by falsifying or manipulating data. While most previous work focuses on enhanced, spoof-aware features, end-to-end models can be a potential alternative. In this paper, we investigate the training of a raw wave front-ends for deep convolutional, long short-term memory (LSTM) and vanilla neural networks, which are analyzed for their suitability toward spoofing detection, regarding the influence of frame size, number of output neurons, and sequence length. A joint convolutional LSTM neural network (CLDNN) is proposed, which outperforms previous attempts on the BTAS2016 dataset (0.82% -> 0.19% HTER), placing itself as the current state-of-the-art model for the dataset. We show that end-to-end approaches a re appropriate for the important replay detection task and show that the proposed model is capable of distinguishing device-invariant spoofing attempts. Regarding the ASVspoof2015 dataset, the end-to-end solution achieves an equal error rate (ERR) of 0.00% for the S1-S9 conditions. We show that the end-to-end approach based on a raw waveform input can outperform common cepstral features, without the use of context-dependent frame extensions. In addition, a cross-database (domain mismatch) scenario is also evaluated, which shows that the proposed CLDNN model trained on the BTAS2016 dataset achieves an EER of 25.7% on the ASVspoof2015 dataset.