Query-Efficient Black-Box Adversarial Attacks on Automatic Speech Recognition.

Chuxuan Tong,Xi Zheng,Jianhua Li,Xingjun Ma,Longxiang Gao,Yong Xiang
DOI: https://doi.org/10.1109/taslp.2023.3304476
2023-01-01
IEEE/ACM Transactions on Audio Speech and Language Processing
Abstract:The susceptibility of Deep Neural Networks (DNNs) to adversarial attacks has raised concerns regarding their practical applications in real-world scenarios. Although the vulnerability of DNNs to adversarial attacks has been extensively studied in the image domain, research in the audio domain, particularly in the black-box setting with Automatic Speech Recognition (ASR) models, remains limited. While various black-box attacks have been proposed for ASR models, such as transfer attacks, hardware attacks, and query-based attacks, this study concentrates on query-based black-box attacks. The article introduces a new gradient estimation technique, Temporal Natural Evolution Strategies (T-NES), to generate adversarial audio samples more efficiently than existing attacks. T-NES leverages the temporal correlation present in audio to speed up gradient estimation based on the probability scores returned by the target model. The empirical results on benchmark datasets, LibriSpeech and TEDLIUM, and two state-of-the-art ASR models, DeepSpeech2 and Wav2Letter, demonstrate that T-NES can generate successful attacks with up to 30% fewer queries than existing attacks within 500 queries. T-NES could provide a robust baseline for evaluating the black-box adversarial vulnerability of ASR systems.
What problem does this paper attempt to address?