Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Rachel S.Y. Teo,Tan M. Nguyen
2024-10-31
Abstract:The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms relies on heuristics and experience. In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in self-attention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in self-attention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPC-Attention over softmax attention on the ImageNet-1K object classification, WikiText-103 language modeling, and ADE20K image segmentation task.
Machine Learning,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to understand and improve the self - attention mechanism from the perspective of Kernel Principal Component Analysis (kernel PCA) in order to enhance its robustness under data contamination and adversarial attacks. Specifically, the paper mainly focuses on the following aspects: 1. **Theoretical Basis**: - Rederive the self - attention mechanism from the perspective of kernel PCA, revealing that self - attention is actually projecting the query vector onto the key component axes of the key matrix in the feature space. - Discover and verify that the value matrix of self - attention captures the eigenvectors of the Gram matrix of the key vectors. 2. **Proposing a New Method**: - Propose a new attention mechanism RPC - Attention based on Robust Principal Components (RPC). This method enhances the robustness of the self - attention mechanism against data contamination and perturbation by solving a convex optimization problem (Principal Component Pursuit, PCP). 3. **Experimental Verification**: - Conduct extensive experiments on ImageNet - 1K image classification, ADE20K image segmentation, and WikiText - 103 language modeling tasks to verify the performance advantages of RPC - Attention on clean data and contaminated data. - Further verify the robustness of RPC - Attention through standard robustness benchmark tests and various white - box and black - box adversarial attacks. ### Formula Summary - **Derivation of the Self - Attention Mechanism**: \[ h_i=\sum_{j = 1}^N \text{softmax}\left(\frac{q_i^\top k_j}{\sqrt{D}}\right)v_j \] where \( q_i \) is the query vector, \( k_j \) is the key vector, \( v_j \) is the value vector, and \( D \) is the feature dimension. - **Self - Attention in the Kernel PCA Framework**: \[ h_i(d)=\sum_{j = 1}^N \frac{k(q_i, k_j)}{g(q_i)}\left(\frac{a_{dj}}{g(k_j)}-\frac{1}{N}\sum_{j' = 1}^N \frac{a_{dj'}}{g(k_j')}\right) \] where \( k(x, y)=\exp\left(\frac{x^\top y}{\sqrt{D}}\right) \) and \( g(x)=\sum_{j = 1}^N k(x, k_j) \). - **The Core Optimization Problem of RPC - Attention**: \[ \min_{L, S}\|L\|_*+\lambda\|S\|_1\quad \text{subject to}\quad L + S = K \] where \( \|L\|_* \) is the nuclear norm of matrix \( L \), \( \|S\|_1 \) is the \( \ell_1 \)-norm of matrix \( S \), and \( K \) is the key matrix. Through these theoretical derivations and experimental verifications, the paper demonstrates the superior performance of RPC - Attention in multiple tasks, especially when facing data contamination and adversarial attacks.