Abstract:Mainstream zero-shot TTS production systems like Voicebox and Seed-TTS achieve human parity speech by leveraging Flow-matching and Diffusion models, respectively. Unfortunately, human-level audio synthesis leads to identity misuse and information security issues. Currently, many antispoofing models have been developed against deepfake audio. However, the efficacy of current state-of-the-art anti-spoofing models in countering audio synthesized by diffusion and flowmatching based TTS systems remains unknown. In this paper, we proposed the Diffusion and Flow-matching based Audio Deepfake (DFADD) dataset. The DFADD dataset collected the deepfake audio based on advanced diffusion and flowmatching TTS models. Additionally, we reveal that current anti-spoofing models lack sufficient robustness against highly human-like audio generated by diffusion and flow-matching TTS systems. The proposed DFADD dataset addresses this gap and provides a valuable resource for developing more resilient anti-spoofing models.

What problem does this paper attempt to address?

The main problem this paper attempts to address is the inadequacy of current state-of-the-art anti-spoofing models in effectively handling deepfake audio generated by Diffusion Models and Flow-matching Models. Specifically: 1. **Background Issues**: - Current zero-shot Text-to-Speech (TTS) systems, such as Voicebox and Seed-TTS, utilize flow-matching and diffusion models respectively to achieve human-like speech synthesis quality. - However, this high-quality speech synthesis also brings issues of identity misuse and information security. - Although many anti-spoofing models have been developed to combat deepfake audio, their effectiveness against audio generated by diffusion and flow-matching models remains unclear. 2. **Research Objectives**: - Propose an audio deepfake dataset based on diffusion and flow-matching (DFADD) to collect deepfake audio generated by these advanced TTS models. - Reveal the shortcomings of current anti-spoofing models in handling highly human-like audio and provide a valuable resource for developing more robust anti-spoofing models. 3. **Main Contributions**: - Created the DFADD dataset, which includes deepfake audio generated by various mainstream diffusion and flow-matching TTS models. - Experimentally evaluated the performance of current state-of-the-art anti-spoofing models in handling these audios, finding significant difficulties in detecting audio generated by diffusion and flow-matching models. - Anti-spoofing models trained using the DFADD dataset showed significant improvement in detecting synthetic speech, particularly in unseen scenarios, with an average Equal Error Rate (EER) reduction of over 47%. In summary, this paper aims to fill the gap in existing anti-spoofing models when dealing with audio generated by new TTS technologies and provides data support for developing more effective anti-spoofing models.

DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset

Ghost-in-Wave: How Speaker-Irrelative Features Interfere DeepFake Voice Detectors

Transferring Audio Deepfake Detection Capability Across Languages

Cross-Domain Audio Deepfake Detection: Dataset and Analysis

Diffuse or Confuse: A Diffusion Deepfake Speech Dataset

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

ADD 2023: Towards Audio Deepfake Detection and Analysis in the Wild

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

System Fingerprint Recognition for Deepfake Audio: An Initial Dataset and Investigation

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

ADD 2022: the First Audio Deep Synthesis Detection Challenge

TranssionADD: A multi-frame reinforcement based sequence tagging model for audio deepfake detection

Speaker Recognition-Assisted Robust Audio Deepfake Detection

DDAM '22: 1st International Workshop on Deepfake Detection for Audio Multimedia

A robust audio deepfake detection system via multi-view feature

Audio Deepfake Detection: A Survey

Audio Anti-Spoofing Detection: A Survey

Audio Deepfake Attribution: An Initial Dataset and Investigation

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

FSD: An Initial Chinese Dataset for Fake Song Detection