A Comparative Study on Physical and Perceptual Features for Deepfake Audio Detection

Menglu Li,Yasaman Ahmadiadli,Xiao-Ping Zhang
DOI: https://doi.org/10.1145/3552466.3556523
2022-01-01
Abstract:Audio content synthesis has stepped into a new era and brought a great threat to daily life since the development of deep learning techniques. The ASVSpoof Challenge and the ADD Challenge have been launched to motivate the development of Deepfake audio detection algorithms. Currently, the detection models, which consist of front-end feature extractors and back-end classifiers, utilize the physical features mainly, rather than the perceptual features that relate to natural emotions or breathiness. Therefore, we provide a comprehensive study on 16 physical and perceptual features and evaluate their effectiveness in both Track 1 and Track 2 of the ADD Challenge. Based on results, PLP, as a perceptual feature, outperforms the rest of the features in Track 1, while CQCC has the best performance in Track 2. Our experiments demonstrate the significance of perceptual features in detecting Deepfake audios. We also seek to explore the underlying characteristics of features that can distinguish Deepfake audio from a real one. We perform statistical analysis on each feature to show its distribution differences on real and synthesized audios. This paper will provide a potential direction in selecting appropriate feature extraction methods for the future implementation of detection models.
What problem does this paper attempt to address?