Explore the Use of Self-supervised Pre-trained Acoustic Features on Disguised Speech Detection

Jie Quan,Yingchun Yang
DOI: https://doi.org/10.1007/978-3-030-86608-2_53
2021-01-01
Abstract:Nowadays disguised voice presents an increasing tendency towards daily life: it has become more and more important and difficult to identify whether an audio file has been disguised. However, researches on such detection are generally based on traditional acoustic features and traditional machine learning classifiers. Our experiment has shown that these methods have two issues: (1) the accuracy needs to be improved, (2) the generalization performance is poor. We considered two ways to figure them out, one is the feature, and the other is the classifier. Thus, we proposed Kekaimalu, based on CNN. Different from previous blind detection methods, Kekaimalu took speaker-independent phonetic characteristics into account in the training process. To test the proposed approach, we first confirmed that vqwav2vec representation carried clear phonetic information. Next, we observed that LCNN with layer normalization can further improve the differentiation. Finally, we merged statistical moments of traditional acoustic features and phonetic characteristics. The extensive experiment demonstrates that detection rates higher than 97%.
What problem does this paper attempt to address?