Spoofing Speaker Verification Systems with Deep Multi-speaker Text-to-speech Synthesis

Mingrui Yuan,Zhiyao Duan
DOI: https://doi.org/10.48550/arXiv.1910.13054
2019-10-29
Abstract:This paper proposes a deep multi-speaker text-to-speech (TTS) model for spoofing speaker verification (SV) systems. The proposed model employs one network to synthesize time-downsampled mel-spectrograms from text input and another network to convert them to linear-frequency spectrograms, which are further converted to the time domain using the Griffin-Lim algorithm. Both networks are trained separately under the generative adversarial networks (GAN) framework. Spoofing experiments on two state-of-the-art SV systems (i-vectors and Google's GE2E) show that the proposed system can successfully spoof these systems with a high success rate. Spoofing experiments on anti-spoofing systems (i.e., binary classifiers for discriminating real and synthetic speech) also show a high spoof success rate when such anti-spoofing systems' structures are exposed to the proposed TTS system.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
This paper aims to solve the spoofing attack problems faced by speaker verification (SV) systems. Specifically, the paper proposes a method based on a deep multi - speaker text - to - speech (TTS) model to deceive the current state - of - the - art SV systems. By generating high - quality synthetic speech, this method can successfully deceive these SV systems, showing the significant vulnerability of these systems. ### Main Research Questions 1. **Feasibility of deceiving SV systems**: Research on how to use synthetic speech generated by deep learning techniques to deceive existing SV systems, especially those using i - vectors and Google's GE2E method. 2. **Evaluation of deception effect**: Experimentally evaluate the deception success rate of the proposed TTS model under different conditions, including the impact on SV systems and anti - spoofing systems. 3. **Impact of model structure disclosure**: Explore the vulnerability of these systems to TTS spoofing attacks when the structure of the anti - spoofing system is disclosed. ### Specific Research Contents - **Model architecture**: The paper proposes a two - stage TTS model. In the first stage, the Text2Mel network is used to generate time - downsampled mel - spectrograms from text input. In the second stage, the Spectrogram Super - resolution Network (SSRN) is used to convert mel - spectrograms into linear frequency spectrograms, and finally, it is converted into a time - domain signal through the Griffin - Lim algorithm. - **Training method**: The two sub - networks of the model are trained respectively in the framework of Generative Adversarial Networks (GAN) to improve the quality of synthetic speech and the deception success rate. - **Experimental setup**: Use the VCTK corpus to train the TTS model and conduct deception experiments on two state - of - the - art SV systems (i - vectors and GE2E). The experimental results show that the proposed TTS model has a high deception success rate under black - box conditions. - **Anti - spoofing system evaluation**: Further evaluate the deception effect of the proposed TTS model on anti - spoofing systems, especially under white - box conditions (i.e., when the structure of the anti - spoofing system is disclosed). ### Main Contributions 1. Propose a multi - speaker TTS spoofing system based on Wasserstein GAN. 2. Through comprehensive experiments, show the high deception success rate of this system against two state - of - the - art SV systems under black - box conditions. 3. Reveal the impact of TTS spoofing on anti - spoofing systems, especially the threat when the structures of these systems are not kept secret. ### Conclusion This paper experimentally verifies the effectiveness of the proposed TTS model in deceiving SV systems and points out the vulnerability of current SV systems and anti - spoofing systems when facing high - quality synthetic speech attacks. Future work will focus on using reinforcement learning to improve the deception ability against black - box SV systems and designing more powerful anti - spoofing systems to resist TTS attacks.