AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

Kirill Borodin,Vasiliy Kudryavtsev,Dmitrii Korzh,Alexey Efimenko,Grach Mkrtchian,Mikhail Gorodnichev,Oleg Y. Rogov
2024-08-30
Abstract:Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address the security issues of automatic speaker verification (ASV) systems when faced with synthetic audio generated by text-to-speech (TTS) and voice conversion (VC) systems. Specifically, with the development of deep learning algorithms, the generation of synthetic audio has become easier, posing potential attack risks to ASV systems. To tackle this challenge, the paper proposes a new architecture—AASIST3, which significantly improves the ability to detect synthetic speech by enhancing the existing AASIST framework with the introduction of Kolmogorov-Arnold networks, additional layers, and preprocessing techniques, thereby enhancing the security of ASV systems. ### Main Improvements: 1. **Kolmogorov-Arnold Network (KAN)**: Enhances the attention mechanism using KAN to extract more relevant features. 2. **Model Expansion**: Increases the model's width to extract more complex parameters and improve performance. 3. **Data Preprocessing**: Utilizes various data augmentation techniques and pre-emphasis techniques to obtain more meaningful frequency information. 4. **Frontend Selection**: Uses SincConv under closed conditions and Wav2Vec2 XLS-R under open conditions, combined with linear or convolutional layers to maintain dimensionality. 5. **Encoder Improvements**: Explores various encoders, including RawNet2-based encoders, Res2Net encoders, etc., and ultimately finds that the classic RawNet2 encoder performs best. 6. **Loss Function and Optimizer**: Tests various loss functions and optimizers, finding that conventional cross-entropy and the Adam optimizer work best. 7. **Training Methods**: Attempts various training methods such as SAM, ASAM, SWL, etc., but ultimately finds that these methods do not significantly improve performance. ### Experimental Results: - Under closed conditions, AASIST3 achieves a minDCF result of 0.5357. - Under open conditions, AASIST3 achieves a minDCF result of 0.1414. These results indicate that AASIST3 has achieved significant performance improvements in detecting synthetic speech, more than doubling the performance of the original AASIST model.