FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection

Bo Wang,Yeling Tang,Fei Wei,Zhongjie Ba,Kui Ren
DOI: https://doi.org/10.1109/taslp.2024.3492796
2024-01-01
Abstract:In recent years, the field of audio deep fake detection has witnessed significant advancements. Nonetheless, the majority of solutions have concentrated on high-quality audio, largely overlooking the challenge of low-quality compressed audio in real-world scenarios. Low-quality compressed audio typically suffers from a loss of high-frequency details and time-domain information, which significantly undermines the performance of advanced deep fake detection systems when confronted with such data. In this paper, we introduce a deep fake detection model that employs knowledge distillation across the frequency and time domains. Our approach aims to train a teacher model with high quality data and a student model with low-quality compressed data. Subsequently, we implement frequency-domain and time domain distillation to facilitate the student model's learning of high-frequency information and time-domain details from the teacher model. Experimental evaluations on the ASVspoof 2019 LA and ASVspoof 2021 DF datasets illustrate the effectiveness of our methodology. On the ASVspoof 2021 DF dataset, which consists of low-quality compressed audio, we achieved an Equal Error Rate (EER) of 2.82%. To our knowledge, this performance is the best among all deep fake voice detection systems tested on the ASVspoof 2021 DF dataset. Additionally, our method proves to be versatile, showing not able performance on high-quality data with an EER of 0.30% on the ASVspoof 2019 LA dataset, closely approaching state-of-the-art results.
What problem does this paper attempt to address?