Injecting Undetectable Backdoors in Deep Learning and Language Models

Alkis Kalavasis,Amin Karbasi,Argyris Oikonomou,Katerina Sotiraki,Grigoris Velegkas,Manolis Zampetakis
2024-06-09
Abstract:As ML models become increasingly complex and integral to high-stakes domains such as finance and healthcare, they also become more susceptible to sophisticated adversarial attacks. We investigate the threat posed by undetectable backdoors in models developed by insidious external expert firms. When such backdoors exist, they allow the designer of the model to sell information to the users on how to carefully perturb the least significant bits of their input to change the classification outcome to a favorable one. We develop a general strategy to plant a backdoor to neural networks while ensuring that even if the model's weights and architecture are accessible, the existence of the backdoor is still undetectable. To achieve this, we utilize techniques from cryptography such as cryptographic signatures and indistinguishability obfuscation. We further introduce the notion of undetectable backdoors to language models and extend our neural network backdoor attacks to such models based on the existence of steganographic functions.
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the threat of injecting undetectable backdoors in deep learning and language models. Specifically, the paper focuses on how to implant backdoors in these models so that even if the weights and architectures of the models are made public, the existence of the backdoors still cannot be detected. Such backdoors allow the designers of the models to change the classification results by fine - tuning the least significant bits of the input data, thereby achieving manipulation of the model output. The paper also explores how to extend this undetectable backdoor technology to language models, using steganography functions to achieve this goal. ### Main contributions of the paper 1. **Construction of undetectable backdoors**: The paper proposes a general and efficient method to construct undetectable backdoors in deep neural networks (DNNs). This method is effective not only in the black - box access mode, but also can ensure the undetectability of the backdoors even in the white - box access mode. 2. **Non - reproducibility**: In addition to undetectability, the paper also ensures the non - reproducibility of the backdoors, that is, an attacker cannot generate new backdoor samples on his own by observing multiple backdoor samples. 3. **Application in language models**: The paper further extends the undetectable backdoor technology to language models (LMs), using steganography techniques to achieve this. ### Technical means - **Encryption technology**: The paper utilizes tools in cryptography such as pseudo - random number generators (PRG), digital signatures and indistinguishability obfuscation to construct backdoors. - **Steganography**: For language models, the paper introduces the concept of steganography, achieving the undetectability of backdoors by embedding hidden information in the text. ### Security and defense Although the paper shows how to construct undetectable backdoors, it also discusses potential defense measures. However, these defense measures cannot completely eliminate the risk of backdoors, because undetectable backdoors reveal fundamental vulnerabilities in modern machine - learning models. ### Related work - **Comparison with [GKVZ22]**: The work of [GKVZ22] mainly focuses on the construction of black - box undetectable backdoors, while this paper extends to white - box undetectable backdoors and is applicable to a wider range of deep - learning models. - **Other related research**: The paper also reviews other research on backdoor attacks, adversarial samples and watermarking techniques, emphasizing the interconnections and differences in these fields. In conclusion, through in - depth research on the construction methods of undetectable backdoors, this paper reveals the potential security risks existing in modern machine - learning systems and proposes corresponding theoretical frameworks and technical means. This provides an important reference for future research and practical applications.