DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset

Jiawei Du,I-Ming Lin,I-Hsiang Chiu,Xuanjun Chen,Haibin Wu,Wenze Ren,Yu Tsao,Hung-yi Lee,Jyh-Shing Roger Jang
2024-09-13
Abstract:Mainstream zero-shot TTS production systems like Voicebox and Seed-TTS achieve human parity speech by leveraging Flow-matching and Diffusion models, respectively. Unfortunately, human-level audio synthesis leads to identity misuse and information security issues. Currently, many antispoofing models have been developed against deepfake audio. However, the efficacy of current state-of-the-art anti-spoofing models in countering audio synthesized by diffusion and flowmatching based TTS systems remains unknown. In this paper, we proposed the Diffusion and Flow-matching based Audio Deepfake (DFADD) dataset. The DFADD dataset collected the deepfake audio based on advanced diffusion and flowmatching TTS models. Additionally, we reveal that current anti-spoofing models lack sufficient robustness against highly human-like audio generated by diffusion and flow-matching TTS systems. The proposed DFADD dataset addresses this gap and provides a valuable resource for developing more resilient anti-spoofing models.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem this paper attempts to address is the inadequacy of current state-of-the-art anti-spoofing models in effectively handling deepfake audio generated by Diffusion Models and Flow-matching Models. Specifically: 1. **Background Issues**: - Current zero-shot Text-to-Speech (TTS) systems, such as Voicebox and Seed-TTS, utilize flow-matching and diffusion models respectively to achieve human-like speech synthesis quality. - However, this high-quality speech synthesis also brings issues of identity misuse and information security. - Although many anti-spoofing models have been developed to combat deepfake audio, their effectiveness against audio generated by diffusion and flow-matching models remains unclear. 2. **Research Objectives**: - Propose an audio deepfake dataset based on diffusion and flow-matching (DFADD) to collect deepfake audio generated by these advanced TTS models. - Reveal the shortcomings of current anti-spoofing models in handling highly human-like audio and provide a valuable resource for developing more robust anti-spoofing models. 3. **Main Contributions**: - Created the DFADD dataset, which includes deepfake audio generated by various mainstream diffusion and flow-matching TTS models. - Experimentally evaluated the performance of current state-of-the-art anti-spoofing models in handling these audios, finding significant difficulties in detecting audio generated by diffusion and flow-matching models. - Anti-spoofing models trained using the DFADD dataset showed significant improvement in detecting synthetic speech, particularly in unseen scenarios, with an average Equal Error Rate (EER) reduction of over 47%. In summary, this paper aims to fill the gap in existing anti-spoofing models when dealing with audio generated by new TTS technologies and provides data support for developing more effective anti-spoofing models.