Temporal Localization of Deepfake Audio Based on Self-Supervised Pretraining Models and Transformer Classifier

Zihan Yan,Hongxia Wang,Mingshan Du,Rui Zhang
DOI: https://doi.org/10.1109/icccbda61447.2024.10569939
2024-01-01
Abstract:With the development of deep learning technology, the ability of deepfake audio is getting stronger and stronger, and localized audio tampering may bring huge semantic changes, posing a great threat to social security. Unlike the true-false binary classification for tampered audio detection, locating the regional location of tampered audio is more challenging. In order to improve the accuracy of localization, the framework proposed in this paper integrates an audio feature extractor based on a self-supervised pretraining model and a transformer-based back-end classifier. First, a large-scale self-supervised pretraining model is used to train the speech representations, such as BYOL-A or WavLM, and then the learned speech representations are fed into the transformer back-end classifier for the temporal localization and regression tasks, which classify each frame and estimate the audio tampering boundaries in order to detect audio tampering segments. Experiments demonstrate that our framework shows good performance for partial forgery detection and localization in challenging environments.
What problem does this paper attempt to address?