Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer

Amit Kumar Singh Yadav,Ziyue Xiang,Kratika Bhagtani,Paolo Bestagini,Stefano Tubaro,Edward J. Delp
2024-02-22
Abstract:Many deep learning synthetic speech generation tools are readily available. The use of synthetic speech has caused financial fraud, impersonation of people, and misinformation to spread. For this reason forensic methods that can detect synthetic speech have been proposed. Existing methods often overfit on one dataset and their performance reduces substantially in practical scenarios such as detecting synthetic speech shared on social platforms. In this paper we propose, Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT), a synthetic speech detector that converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network. We evaluate the detection performance of PS3DT on ASVspoof2019 dataset. Our experiments show that PS3DT performs well on ASVspoof2019 dataset compared to other approaches using spectrogram for synthetic speech detection. We also investigate generalization performance of PS3DT on In-the-Wild dataset. PS3DT generalizes well than several existing methods on detecting synthetic speech from an out-of-distribution dataset. We also evaluate robustness of PS3DT to detect telephone quality synthetic speech and synthetic speech shared on social platforms (compressed speech). PS3DT is robust to compression and can detect telephone quality synthetic speech better than several existing methods.
Sound,Computer Vision and Pattern Recognition,Machine Learning,Audio and Speech Processing,Signal Processing
What problem does this paper attempt to address?
The main focus of this paper is to explore how to detect synthetic speech, which refers to the voice generated by models rather than spoken by humans. With the development of deep learning technology, the audio quality of synthetic speech has become very close to that of real human speech. This has brought convenience in fields such as voice assistants, education, and advertising, but it has also been used for malicious purposes such as fraud, impersonation, and spreading misinformation. Existing methods for detecting synthetic speech often overfit on a single dataset and perform poorly in practical applications (such as compressed speech on social media platforms). To address this issue, the paper proposes a new method called Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT). This method converts the temporal speech signal into a mel spectrogram and uses a Transformer neural network to process the spectrogram in blocks. Experimental results show that PS3DT outperforms other spectrogram-based detection methods on the ASVspoof2019 dataset and has good generalization ability on unseen datasets (In-the-Wild). Furthermore, PS3DT demonstrates robustness to compressed speech (such as telephone-quality speech) and can better detect synthetic speech under such conditions. Compared to existing methods, PS3DT performs better in telephone channels and effectively detects synthetic speech used to deceive automatic voice verification systems or impersonate others. In summary, the paper aims to address the challenges in synthetic speech detection, including cross-dataset generalization and adaptability to compressed and telephone communication environments. By proposing a new Transformer-based processing approach, it improves the accuracy and robustness of detection.