DiffSSD: A Diffusion-Based Dataset For Speech Forensics

Kratika Bhagtani,Amit Kumar Singh Yadav,Paolo Bestagini,Edward J. Delp
2024-10-02
Abstract:Diffusion-based speech generators are ubiquitous. These methods can generate very high quality synthetic speech and several recent incidents report their malicious use. To counter such misuse, synthetic speech detectors have been developed. Many of these detectors are trained on datasets which do not include diffusion-based synthesizers. In this paper, we demonstrate that existing detectors trained on one such dataset, ASVspoof2019, do not perform well in detecting synthetic speech from recent diffusion-based synthesizers. We propose the Diffusion-Based Synthetic Speech Dataset (DiffSSD), a dataset consisting of about 200 hours of labeled speech, including synthetic speech generated by 8 diffusion-based open-source and 2 commercial generators. We also examine the performance of existing synthetic speech detectors on DiffSSD in both closed-set and open-set scenarios. The results highlight the importance of this dataset in detecting synthetic speech generated from recent open-source and commercial speech generators.
Audio and Speech Processing,Computer Vision and Pattern Recognition,Multimedia,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that existing synthetic speech detectors perform poorly in detecting high - quality synthetic speech generated by diffusion models. Specifically, most of the existing synthetic speech detectors are trained on datasets containing traditional speech generation methods (such as RNN, HMM, Transformer and GAN), for example, the ASVspoof2019 dataset. However, these detectors lack sufficient detection ability for synthetic speech generated by the latest diffusion models. To meet this challenge, the author proposes a new dataset named Diffusion - Based Synthetic Speech Dataset (DiffSSD). This dataset contains approximately 200 hours of annotated speech data, including synthetic speech generated by 8 open - source diffusion models and 2 commercial tools. By using DiffSSD, the author hopes to improve and evaluate the performance of existing synthetic speech detectors, especially the detection ability in closed - set and open - set scenarios. The following is the specific description of this problem: 1. **Background and Motivation**: - The development of synthetic speech technology makes it easier to generate high - quality synthetic speech, but it also brings problems of malicious use, such as fraud, false information dissemination, etc. - Existing synthetic speech detectors are mainly based on traditional speech generation methods and have poor detection effects on the speech generated by the latest diffusion models. 2. **Research Objectives**: - Propose a new dataset DiffSSD, which contains high - quality synthetic speech generated by diffusion models. - Evaluate the performance of existing synthetic speech detectors on DiffSSD to reveal their limitations in detecting the latest synthetic speech. - Explore how to improve synthetic speech detectors so that they can effectively detect the speech generated by diffusion models. 3. **Solutions**: - Construct the DiffSSD dataset, which contains approximately 200 hours of annotated speech data and covers synthetic speech generated by multiple diffusion models. - Retrain and evaluate existing synthetic speech detectors and verify their performance on DiffSSD. - Analyze the performance differences of different detection methods in closed - set and open - set scenarios to provide references for future research. Through the above measures, the author aims to improve the generalization ability and accuracy of synthetic speech detectors to deal with the increasingly complex synthetic speech threats.