Abstract:As ML models become increasingly complex and integral to high-stakes domains such as finance and healthcare, they also become more susceptible to sophisticated adversarial attacks. We investigate the threat posed by undetectable backdoors in models developed by insidious external expert firms. When such backdoors exist, they allow the designer of the model to sell information to the users on how to carefully perturb the least significant bits of their input to change the classification outcome to a favorable one. We develop a general strategy to plant a backdoor to neural networks while ensuring that even if the model's weights and architecture are accessible, the existence of the backdoor is still undetectable. To achieve this, we utilize techniques from cryptography such as cryptographic signatures and indistinguishability obfuscation. We further introduce the notion of undetectable backdoors to language models and extend our neural network backdoor attacks to such models based on the existence of steganographic functions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the threat of injecting undetectable backdoors in deep learning and language models. Specifically, the paper focuses on how to implant backdoors in these models so that even if the weights and architectures of the models are made public, the existence of the backdoors still cannot be detected. Such backdoors allow the designers of the models to change the classification results by fine - tuning the least significant bits of the input data, thereby achieving manipulation of the model output. The paper also explores how to extend this undetectable backdoor technology to language models, using steganography functions to achieve this goal. ### Main contributions of the paper 1. **Construction of undetectable backdoors**: The paper proposes a general and efficient method to construct undetectable backdoors in deep neural networks (DNNs). This method is effective not only in the black - box access mode, but also can ensure the undetectability of the backdoors even in the white - box access mode. 2. **Non - reproducibility**: In addition to undetectability, the paper also ensures the non - reproducibility of the backdoors, that is, an attacker cannot generate new backdoor samples on his own by observing multiple backdoor samples. 3. **Application in language models**: The paper further extends the undetectable backdoor technology to language models (LMs), using steganography techniques to achieve this. ### Technical means - **Encryption technology**: The paper utilizes tools in cryptography such as pseudo - random number generators (PRG), digital signatures and indistinguishability obfuscation to construct backdoors. - **Steganography**: For language models, the paper introduces the concept of steganography, achieving the undetectability of backdoors by embedding hidden information in the text. ### Security and defense Although the paper shows how to construct undetectable backdoors, it also discusses potential defense measures. However, these defense measures cannot completely eliminate the risk of backdoors, because undetectable backdoors reveal fundamental vulnerabilities in modern machine - learning models. ### Related work - **Comparison with [GKVZ22]**: The work of [GKVZ22] mainly focuses on the construction of black - box undetectable backdoors, while this paper extends to white - box undetectable backdoors and is applicable to a wider range of deep - learning models. - **Other related research**: The paper also reviews other research on backdoor attacks, adversarial samples and watermarking techniques, emphasizing the interconnections and differences in these fields. In conclusion, through in - depth research on the construction methods of undetectable backdoors, this paper reveals the potential security risks existing in modern machine - learning systems and proposes corresponding theoretical frameworks and technical means. This provides an important reference for future research and practical applications.

Injecting Undetectable Backdoors in Deep Learning and Language Models

Planting Undetectable Backdoors in Machine Learning Models

Oblivious Defense in ML Models: Backdoor Removal without Detection

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Stand-in Backdoor: A Stealthy and Powerful Backdoor Attack

Invisible Backdoor Attacks on Deep Neural Networks via Steganography and Regularization

ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks

Hidden Backdoors in Human-Centric Language Models

Hiding Backdoors within Event Sequence Data via Poisoning Attacks

Stealthy and Flexible Trojan in Deep Learning Framework

Rethinking Stealthiness of Backdoor Attack Against NLP Models.

Data Stealing Attacks against Large Language Models via Backdooring

Regula Sub-rosa: Latent Backdoor Attacks on Deep Neural Networks

On Model Outsourcing Adaptive Attacks to Deep Learning Backdoor Defenses

Dynamic Backdoor Attacks Against Machine Learning Models

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Backdoor Attacks for In-Context Learning with Language Models

AdvDoor: Adversarial Backdoor Attack of Deep Learning System

Persistent Backdoor Attacks in Continual Learning

Architectural Neural Backdoors from First Principles

Model-agnostic clean-label backdoor mitigation in cybersecurity environments