Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer

Amit Kumar Singh Yadav,Ziyue Xiang,Kratika Bhagtani,Paolo Bestagini,Stefano Tubaro,Edward J. Delp

2024-02-22

Abstract:Many deep learning synthetic speech generation tools are readily available. The use of synthetic speech has caused financial fraud, impersonation of people, and misinformation to spread. For this reason forensic methods that can detect synthetic speech have been proposed. Existing methods often overfit on one dataset and their performance reduces substantially in practical scenarios such as detecting synthetic speech shared on social platforms. In this paper we propose, Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT), a synthetic speech detector that converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network. We evaluate the detection performance of PS3DT on ASVspoof2019 dataset. Our experiments show that PS3DT performs well on ASVspoof2019 dataset compared to other approaches using spectrogram for synthetic speech detection. We also investigate generalization performance of PS3DT on In-the-Wild dataset. PS3DT generalizes well than several existing methods on detecting synthetic speech from an out-of-distribution dataset. We also evaluate robustness of PS3DT to detect telephone quality synthetic speech and synthetic speech shared on social platforms (compressed speech). PS3DT is robust to compression and can detect telephone quality synthetic speech better than several existing methods.

Sound,Computer Vision and Pattern Recognition,Machine Learning,Audio and Speech Processing,Signal Processing

What problem does this paper attempt to address?

The main focus of this paper is to explore how to detect synthetic speech, which refers to the voice generated by models rather than spoken by humans. With the development of deep learning technology, the audio quality of synthetic speech has become very close to that of real human speech. This has brought convenience in fields such as voice assistants, education, and advertising, but it has also been used for malicious purposes such as fraud, impersonation, and spreading misinformation. Existing methods for detecting synthetic speech often overfit on a single dataset and perform poorly in practical applications (such as compressed speech on social media platforms). To address this issue, the paper proposes a new method called Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT). This method converts the temporal speech signal into a mel spectrogram and uses a Transformer neural network to process the spectrogram in blocks. Experimental results show that PS3DT outperforms other spectrogram-based detection methods on the ASVspoof2019 dataset and has good generalization ability on unseen datasets (In-the-Wild). Furthermore, PS3DT demonstrates robustness to compressed speech (such as telephone-quality speech) and can better detect synthetic speech under such conditions. Compared to existing methods, PS3DT performs better in telephone channels and effectively detects synthetic speech used to deceive automatic voice verification systems or impersonate others. In summary, the paper aims to address the challenges in synthetic speech detection, including cross-dataset generalization and adaptability to compressed and telephone communication environments. By proposing a new Transformer-based processing approach, it improves the accuracy and robustness of detection.

Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer

Ghost-in-Wave: How Speaker-Irrelative Features Interfere DeepFake Voice Detectors

Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis

Deepfake audio detection by speaker verification

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation

Hybrid Transformer Architectures With Diverse Audio Features for Deepfake Speech Classification

Detecting Synthetic Speech Manipulation in Real Audio Recordings

DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Voice Presentation Attack Detection Using Convolutional Neural Networks

All-for-One and One-For-All: Deep learning-based feature fusion for Synthetic Speech Detection

Can DeepFake Speech be Reliably Detected?

DiffSSD: A Diffusion-Based Dataset For Speech Forensics

FairSSD: Understanding Bias in Synthetic Speech Detectors

Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs

AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

Multi-branch Network with Circle Loss Using Voice Conversion and Channel Robust Data Augmentation for Synthetic Speech Detection.

Multi-Task Learning Improves Synthetic Speech Detection

Spoofing Detection Goes Noisy: An Analysis of Synthetic Speech Detection in the Presence of Additive Noise

Mitigating Unauthorized Speech Synthesis for Voice Protection