Abstract:This project investigates the behavior of multi-head attention in Transformer models, specifically focusing on the differences between benign and trojan models in the context of sentiment analysis. Trojan attacks cause models to perform normally on clean inputs but exhibit misclassifications when presented with inputs containing predefined triggers. We characterize attention head functions in trojan and benign models, identifying specific 'trojan' heads and analyzing their behavior.

What problem does this paper attempt to address?

The paper aims to address the lack of understanding of the behavior of the multi-head attention mechanism in Trojan-attacked BERT models (i.e., models with preset triggers). Specifically, the paper focuses on the following aspects: 1. **Identifying "Trojan" Heads**: Researchers attempt to identify which attention heads exhibit abnormal behavior in Trojan models by analyzing the multi-head attention mechanism and explaining the behavior of these heads. 2. **Constructing Attention-Based Detectors**: The paper attempts to develop an attention diversity-based detector to determine whether a model has been Trojan-attacked. This includes three different detection methods: - **Simple Detector**: Assumes the trigger is known and observes changes in the model's attention behavior by inserting the trigger. - **Enumeration Trigger Detector**: Enumerates all possible triggers, checks if they can flip the model's prediction label, and analyzes their attention behavior. - **Reverse Engineering Detector**: Attempts to find possible triggers through reverse engineering methods and then tests the attention behavior of these triggers. 3. **Understanding the Differences Between Trojan and Benign Models**: The paper reveals the functions of specific heads in Trojan models, such as trigger heads, semantic heads, and specific head behaviors, by comparing the differences in the multi-head attention mechanism between Trojan and benign models. 4. **Validating the Universality of Patterns**: Researchers conduct statistical analysis on a large number of models to verify the universality and distinctiveness of these attention patterns in Trojan models. Overall, the paper aims to enhance the understanding of Trojan-attacked models through in-depth analysis of the multi-head attention mechanism and proposes effective detection methods to improve the security of the natural language processing field.

Analyzing Multi-Head Attention on Trojan BERT Models

The Topological BERT: Transforming Attention into Topology for Natural Language Processing

Improving the Robustness of Transformer-based Large Language Models with Dynamic Attention

What Does BERT Look At? An Analysis of BERT's Attention

A Multiscale Visualization of Attention in the Transformer Model

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Trojaning Language Models for Fun and Profit

A Optimized BERT for Multimodal Sentiment Analysis

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks

Bias A-head? Analyzing Bias in Transformer-Based Language Model Attention Heads

Unitary Multi-Margin BERT for Robust Natural Language Processing

Interpreting and Exploiting Functional Specialization in Multi-Head Attention under Multi-task Learning

TrojText: Test-time Invisible Textual Trojan Insertion

Unveiling Vulnerability of Self-Attention

Generalized Probabilistic Attention Mechanism in Transformers

Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models

Are Sixteen Heads Really Better than One?