Abstract:Trojan attacks are sophisticated training-time attacks on neural networks that embed backdoor triggers which force the network to produce a specific output on any input which includes the trigger. With the increasing relevance of deep networks which are too large to train with personal resources and which are trained on data too large to thoroughly audit, these training-time attacks pose a significant risk. In this work, we connect trojan attacks to Neural Collapse, a phenomenon wherein the final feature representations of over-parameterized neural networks converge to a simple geometric structure. We provide experimental evidence that trojan attacks disrupt this convergence for a variety of datasets and architectures. We then use this disruption to design a lightweight, broadly generalizable mechanism for cleansing trojan attacks from a wide variety of different network architectures and experimentally demonstrate its efficacy.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of Trojan horse attacks in neural networks. Specifically, a Trojan horse attack is an attack method during training. By embedding malicious triggers in the training data, the trained model will output the results specified by the attacker when it encounters an input containing a specific trigger.
The main contributions and goals of the paper include:
1. **Understanding the impact of Trojan horse attacks on the neural network structure**:
- The author uses the new achievements in deep learning theory, especially the relationship with the "Neural Collapse (NC)" phenomenon, to explain how Trojan horse attacks change the internal structure of the neural network.
2. **Experimental verification of these conclusions**:
- Through a series of experiments, the author shows that Trojan horse attacks do indeed disrupt the neural collapse phenomenon, which provides a theoretical basis for subsequent defense mechanisms.
3. **Design and test a lightweight and widely applicable Trojan horse removal method**:
- Based on the above findings, the author proposes a method named ETF - FT, which can effectively remove Trojan horse triggers without compromising the model's performance on clean data.
### Specific content of the paper
#### 1. Introduction
- Deep neural networks have achieved excellent performance in fields such as autonomous driving, medical diagnosis, and financial investment, but they also face various security threats, especially Trojan horse attacks.
- Trojan horse attacks add specific triggers to the training data through data poisoning, causing the model to misclassify when it encounters an input containing a trigger.
- This paper for the first time links Trojan horse attacks with the neural collapse phenomenon and explores how Trojan horse attacks destroy the symmetric structure of the neural network.
#### 2. Methodology
- **Problem setting**: Consider a K - category classification problem, using d - dimensional training samples and a neural network model \( f: \mathbb{R}^d \rightarrow \mathbb{R}^K \).
- **Implementation of Trojan horse attacks**: Triggers are introduced through data poisoning. A small solid - state patch is selected and superimposed on the base image, and the poisoned samples are assigned to the target category at a certain proportion.
- **Neural collapse phenomenon**: It describes that during the training process of an over - parameterized neural network, the feature representation and the weight matrix gradually converge to a highly symmetric geometric structure.
#### 3. Experimental results
- **Disruption of the neural collapse phenomenon**: Experiments show that Trojan horse attacks significantly slow down and weaken the neural collapse phenomenon, especially showing differences in multiple dimensions.
- **Trojan horse removal method ETF - FT**: By replacing the weights of the final fully - connected layer with randomly generated simple equiangular tight frames (ETF) and then fine - tuning the model on a small amount of clean data, effective Trojan horse removal is achieved.
#### 4. Result analysis
- **Performance on different datasets**: The experimental results on the CIFAR - 10, CIFAR - 100, and GTSRB datasets show that Trojan horse attacks significantly weaken the neural collapse phenomenon.
- **Comparison of removal effects**: The ETF - FT method performs well under multiple model architectures and attack types. In particular, when only a small amount of clean data is available, it can still maintain a high accuracy rate and effectively reduce the attack success rate.
### Summary
This paper provides a new perspective to understand and defend against Trojan horse attacks by linking Trojan horse attacks with the neural collapse phenomenon, and proposes a lightweight and widely applicable removal method ETF - FT. This research not only deepens the understanding of the Trojan horse attack mechanism but also provides strong support for the model security in practical applications.