Analyzing Multi-Head Attention on Trojan BERT Models

Jingwei Wang
2024-06-12
Abstract:This project investigates the behavior of multi-head attention in Transformer models, specifically focusing on the differences between benign and trojan models in the context of sentiment analysis. Trojan attacks cause models to perform normally on clean inputs but exhibit misclassifications when presented with inputs containing predefined triggers. We characterize attention head functions in trojan and benign models, identifying specific 'trojan' heads and analyzing their behavior.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the lack of understanding of the behavior of the multi-head attention mechanism in Trojan-attacked BERT models (i.e., models with preset triggers). Specifically, the paper focuses on the following aspects: 1. **Identifying "Trojan" Heads**: Researchers attempt to identify which attention heads exhibit abnormal behavior in Trojan models by analyzing the multi-head attention mechanism and explaining the behavior of these heads. 2. **Constructing Attention-Based Detectors**: The paper attempts to develop an attention diversity-based detector to determine whether a model has been Trojan-attacked. This includes three different detection methods: - **Simple Detector**: Assumes the trigger is known and observes changes in the model's attention behavior by inserting the trigger. - **Enumeration Trigger Detector**: Enumerates all possible triggers, checks if they can flip the model's prediction label, and analyzes their attention behavior. - **Reverse Engineering Detector**: Attempts to find possible triggers through reverse engineering methods and then tests the attention behavior of these triggers. 3. **Understanding the Differences Between Trojan and Benign Models**: The paper reveals the functions of specific heads in Trojan models, such as trigger heads, semantic heads, and specific head behaviors, by comparing the differences in the multi-head attention mechanism between Trojan and benign models. 4. **Validating the Universality of Patterns**: Researchers conduct statistical analysis on a large number of models to verify the universality and distinctiveness of these attention patterns in Trojan models. Overall, the paper aims to enhance the understanding of Trojan-attacked models through in-depth analysis of the multi-head attention mechanism and proposes effective detection methods to improve the security of the natural language processing field.