Abstract:The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms relies on heuristics and experience. In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in self-attention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in self-attention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPC-Attention over softmax attention on the ImageNet-1K object classification, WikiText-103 language modeling, and ADE20K image segmentation task.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to understand and improve the self - attention mechanism from the perspective of Kernel Principal Component Analysis (kernel PCA) in order to enhance its robustness under data contamination and adversarial attacks. Specifically, the paper mainly focuses on the following aspects: 1. **Theoretical Basis**: - Rederive the self - attention mechanism from the perspective of kernel PCA, revealing that self - attention is actually projecting the query vector onto the key component axes of the key matrix in the feature space. - Discover and verify that the value matrix of self - attention captures the eigenvectors of the Gram matrix of the key vectors. 2. **Proposing a New Method**: - Propose a new attention mechanism RPC - Attention based on Robust Principal Components (RPC). This method enhances the robustness of the self - attention mechanism against data contamination and perturbation by solving a convex optimization problem (Principal Component Pursuit, PCP). 3. **Experimental Verification**: - Conduct extensive experiments on ImageNet - 1K image classification, ADE20K image segmentation, and WikiText - 103 language modeling tasks to verify the performance advantages of RPC - Attention on clean data and contaminated data. - Further verify the robustness of RPC - Attention through standard robustness benchmark tests and various white - box and black - box adversarial attacks. ### Formula Summary - **Derivation of the Self - Attention Mechanism**: \[ h_i=\sum_{j = 1}^N \text{softmax}\left(\frac{q_i^\top k_j}{\sqrt{D}}\right)v_j \] where \( q_i \) is the query vector, \( k_j \) is the key vector, \( v_j \) is the value vector, and \( D \) is the feature dimension. - **Self - Attention in the Kernel PCA Framework**: \[ h_i(d)=\sum_{j = 1}^N \frac{k(q_i, k_j)}{g(q_i)}\left(\frac{a_{dj}}{g(k_j)}-\frac{1}{N}\sum_{j' = 1}^N \frac{a_{dj'}}{g(k_j')}\right) \] where \( k(x, y)=\exp\left(\frac{x^\top y}{\sqrt{D}}\right) \) and \( g(x)=\sum_{j = 1}^N k(x, k_j) \). - **The Core Optimization Problem of RPC - Attention**: \[ \min_{L, S}\|L\|_*+\lambda\|S\|_1\quad \text{subject to}\quad L + S = K \] where \( \|L\|_* \) is the nuclear norm of matrix \( L \), \( \|S\|_1 \) is the \( \ell_1 \)-norm of matrix \( S \), and \( K \) is the key matrix. Through these theoretical derivations and experimental verifications, the paper demonstrates the superior performance of RPC - Attention in multiple tasks, especially when facing data contamination and adversarial attacks.

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

A Primal-Dual Framework for Transformers and Neural Networks

Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation

Elliptical Attention

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Interpreting and Improving Attention From the Perspective of Large Kernel Convolution

Understanding Self-Attention of Self-Supervised Audio Transformers

Agent Attention: On the Integration of Softmax and Linear Attention

Dissecting Query-Key Interaction in Vision Transformers

Synthesizer Based Efficient Self-Attention for Vision Tasks

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

SparseBERT: Rethinking the Importance Analysis in Self-attention

Assessing the Impact of Attention and Self-Attention Mechanisms on the Classification of Skin Lesions

On the Expressive Power of Self-Attention Matrices

AttentionViz: A Global View of Transformer Attention

Rethinking the role of attention mechanism: a causality perspective

Core-Periphery Principle Guided Redesign of Self-Attention in Transformers

Multi Resolution Analysis (MRA) for Approximate Self-Attention

Masked Attention as a Mechanism for Improving Interpretability of Vision Transformers

Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention