Certification of Speaker Recognition Models to Additive Perturbations

Dmitrii Korzh,Elvir Karimov,Mikhail Pautov,Oleg Y. Rogov,Ivan Oseledets
2024-04-29
Abstract:Speaker recognition technology is applied in various tasks ranging from personal virtual assistants to secure access systems. However, the robustness of these systems against adversarial attacks, particularly to additive perturbations, remains a significant challenge. In this paper, we pioneer applying robustness certification techniques to speaker recognition, originally developed for the image domain. In our work, we cover this gap by transferring and improving randomized smoothing certification techniques against norm-bounded additive perturbations for classification and few-shot learning tasks to speaker recognition. We demonstrate the effectiveness of these methods on VoxCeleb 1 and 2 datasets for several models. We expect this work to improve voice-biometry robustness, establish a new certification benchmark, and accelerate research of certification methods in the audio domain.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The main focus of this paper is the security of speech recognition systems, particularly the robustness against adversarial attacks. Although current deep learning speech biometric models perform well in various applications, they are susceptible to specific perturbations that may be imperceptible to humans but significantly degrade the model's performance. The paper proposes a novel random smoothing method for certifying the robustness of speech recognition models against adversarial additive perturbations, initially developed for the image domain. In the methodology section, the paper transforms the speech recognition problem into a few-shot problem, where the model needs to identify the speaker from a given audio. Through the random smoothing technique, the paper provides robustness guarantees against adversarial attacks, even when the attacker has knowledge of the model's architecture, parameters, and gradients. The authors propose a smoothing framework based on Gaussian noise to compute the certification radius of the model under an l2 norm constraint and demonstrate the theoretical advantages of this method. In the experimental section, the paper evaluates the certification accuracy of different models using the VoxCeleb1 and VoxCeleb2 datasets and compares them with existing methods. The results show that the proposed method outperforms other methods in terms of certification accuracy, especially in the case of few-shot settings. However, the certification results are currently lenient, with a gap between practical accuracy and certification accuracy, and are limited to additive perturbations only. Further research is needed for more complex attack forms, such as deepfakes. Overall, the paper fills the gap in adversarial attack certification for speech recognition and contributes to improving the security of speech biometrics and establishing new certification benchmarks.