DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Amit Kumar Singh Yadav,Kratika Bhagtani,Ziyue Xiang,Paolo Bestagini,Stefano Tubaro,Edward J. Delp
2023-07-29
Abstract:Tools to generate high quality synthetic speech signal that is perceptually indistinguishable from speech recorded from human speakers are easily available. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods as a black box without providing reasoning for the decisions they make. This limits the interpretability of these approaches. In this paper, we propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two staged trained variational autoencoder that processes spectrograms of speech using disentangled representation learning to generate interpretable representations of a speech signal for detecting synthetic speech. DSVAE also creates an activation map to highlight the spectrogram regions that discriminate synthetic and bona fide human speech signals. We evaluated the representations obtained from DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers. We also visualize the representation obtained from DSVAE for 17 different speech synthesizers and verify that they are indeed interpretable and discriminate bona fide and synthetic speech from each of the synthesizers.
Sound,Computer Vision and Pattern Recognition,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the issue of synthetic speech detection and improve the interpretability of detection methods. Specifically, the paper proposes a new method called "Disentangled Spectral Variational Autoencoder" (DSV AE) to distinguish between real human speech and synthetic speech. Although many current synthetic speech detection methods have high accuracy, they lack the ability to explain the decision-making process, limiting the transparency and interpretability of these methods. To solve this problem, the paper utilizes disentangled representation learning techniques by training a two-stage variational autoencoder to process speech spectrograms, thereby generating features that can distinguish between real and synthetic speech. Experimental results show that DSV AE performs excellently on the ASVspoof2019 dataset, achieving a detection accuracy of over 98% for both known and unknown synthesizers, except for one particularly challenging synthesizer, A17. Additionally, by visualizing the disentangled features, the working principle of DSV AE can be intuitively demonstrated, enhancing the interpretability of the method. The method also performs well in practical scenarios, such as detecting synthetic speech uploaded to social platforms and countering simple attacks (e.g., removing silent regions).