ADD 2023: Towards Audio Deepfake Detection and Analysis in the Wild

Jiangyan Yi,Chu Yuan Zhang,Jianhua Tao,Chenglong Wang,Xinrui Yan,Yong Ren,Hao Gu,Junzuo Zhou
2024-09-12
Abstract:The growing prominence of the field of audio deepfake detection is driven by its wide range of applications, notably in protecting the public from potential fraud and other malicious activities, prompting the need for greater attention and research in this area. The ADD 2023 challenge goes beyond binary real/fake classification by emulating real-world scenarios, such as the identification of manipulated intervals in partially fake audio and determining the source responsible for generating any fake audio, both with real-life implications, notably in audio forensics, law enforcement, and construction of reliable and trustworthy evidence. To further foster research in this area, in this article, we describe the dataset that was used in the fake game, manipulation region location and deepfake algorithm recognition tracks of the challenge. We also focus on the analysis of the technical methodologies by the top-performing participants in each task and note the commonalities and differences in their approaches. Finally, we discuss the current technical limitations as identified through the technical analysis, and provide a roadmap for future research directions. The dataset is available for download at <a class="link-external link-http" href="http://addchallenge.cn/downloadADD2023" rel="external noopener nofollow">this http URL</a>.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The main goal of this paper is to address several key issues in audio deepfake detection and to advance the related technologies. Specifically: 1. **Multi-task Challenge Design**: The paper introduces the ADD 2023 challenge, which aims to go beyond traditional binary classification (real/fake) methods to further enhance the capability of audio deepfake detection. The challenge includes three main tracks: - **Audio Forgery Game (FG)**: Divided into generation tasks (FG-G) and detection tasks (FG-D), simulating the game process of attack and defense. - **Manipulation Region Localization (RL)**: Identifying the specific time segments that have been tampered with in partially forged audio. - **Deepfake Algorithm Recognition (AR)**: Determining the algorithm used to generate specific forged audio. 2. **Dataset Design**: To better simulate real-world scenarios, the paper details the datasets used for each track. These datasets include not only real speech samples but also various forged audio samples, covering different generation techniques and environmental conditions. For example: - **Generation Task (FG-G)** uses the AISHELL-3 dataset for training, ensuring high-quality and realistic generated audio. - **Detection Task (FG-D)** includes forged audio samples from various generation models (such as HiFiGAN, LPCNet, etc.). - **Manipulation Region Localization (RL)** dataset simulates partial tampering by splicing real recordings or forged audio. - **Deepfake Algorithm Recognition (AR)** dataset includes audio samples generated by known and unknown algorithms to test the model's recognition ability under different conditions. 3. **Evaluation Metrics**: The paper also defines evaluation metrics for each track, such as Deception Success Rate (DSR), Equal Error Rate (EER), Sentence-level Accuracy (As), and Segment-level F1 Score (F1s), to comprehensively assess the performance of the participating systems. In summary, this paper aims to advance the technology of audio deepfake detection by proposing a series of challenging tasks and carefully designed datasets, making it more reliable and effective in practical applications.