Synthetic Speech Detection Based on the Temporal Consistency of Speaker Features

Yuxiang Zhang,Zhuo Li,Jingze Lu,Wenchao Wang,Pengyuan Zhang
DOI: https://doi.org/10.1109/lsp.2024.3381890
2024-04-05
IEEE Signal Processing Letters
Abstract:Current synthetic speech detection (SSD) methods perform well on specific datasets but require improvement in interpretability and robustness. One possible reason is the lack of interpretability analysis of synthetic speech defects. In this paper, the flaws in the temporal consistency (TC) of speaker features inherent in the speech synthesis process are analyzed. Differences in the TC of intra-utterance speaker features arise due to limited control over speaker features during speech synthesis. The speech generated by text-to-speech algorithms exhibits higher TC, while the speech generated by voice conversion algorithms yields slightly lower TC compared to bonafide speech. Based on this finding, a new SSD method based on the TC of speaker features is proposed. Modeling the TC of intra-utterance speaker features extracted by a pre-trained ASV system can be used for SSD. The proposed method achieves equal error rates of 0.84%, 3.93%, 12.98% and 24.66% on the ASVspoof 2019 LA, 2021 LA, 2021 DF and IntheWild evaluation datasets, respectively, demonstrating strong interpretability and robustness.
engineering, electrical & electronic
What problem does this paper attempt to address?