Detection of AI-Synthesized Speech Using Cepstral & Bispectral Statistics

Arun Kumar Singh,Priyanka Singh
DOI: https://doi.org/10.48550/arXiv.2009.01934
2021-04-11
Abstract:Digital technology has made possible unimaginable applications come true. It seems exciting to have a handful of tools for easy editing and manipulation, but it raises alarming concerns that can propagate as speech clones, duplicates, or maybe deep fakes. Validating the authenticity of a speech is one of the primary problems of digital audio forensics. We propose an approach to distinguish human speech from AI synthesized speech exploiting the Bi-spectral and Cepstral analysis. Higher-order statistics have less correlation for human speech in comparison to a synthesized speech. Also, Cepstral analysis revealed a durable power component in human speech that is missing for a synthesized speech. We integrate both these analyses and propose a machine learning model to detect AI synthesized speech.
Machine Learning,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the discrimination problem between human speech and artificial - intelligence - synthesized speech. With the development of digital technology, especially the progress of artificial intelligence and deep neural networks, synthesized audio and speech have become more realistic, which has led to the generation of forged speech. These forged speeches may be used for various improper purposes. Therefore, verifying the authenticity of speech has become an important issue in digital audio forensics. The paper proposes a method that combines bispectral analysis and Mel - Frequency Cepstral Coefficients (MFCC) to detect AI - synthesized speech. Specifically, the paper makes use of the following points: 1. **Bispectral Analysis**: Bispectral analysis can reveal higher - order statistical properties in the signal. These properties are less common in human speech but more prominent in synthesized speech. The paper mentions that the third - order bispectral correlation is more difficult to adjust in synthesized speech and can be used to distinguish between human speech and AI - synthesized speech. 2. **Mel - Frequency Cepstral Coefficients (MFCC) and Their Derivatives**: MFCC are commonly used speech features that can reflect the shape of the human vocal cords. The paper not only uses MFCC but also adds its first - order difference (Δ - Cepstrum) and second - order difference (Δ2 - Cepstrum) to enhance the discriminative ability of the features. By combining these features, the paper aims to provide a more robust method to distinguish between human speech and AI - synthesized speech. Especially in practical applications, where audio data may contain noise and interference, this method can still work effectively.