On the Anatomy of Attention

Nikhil Khatri,Tuomas Laakkonen,Jonathon Liu,Vincent Wang-Maścianica
2024-07-08
Abstract:We introduce a category-theoretic diagrammatic formalism in order to systematically relate and reason about machine learning models. Our diagrams present architectures intuitively but without loss of essential detail, where natural relationships between models are captured by graphical transformations, and important differences and similarities can be identified at a glance. In this paper, we focus on attention mechanisms: translating folklore into mathematical derivations, and constructing a taxonomy of attention variants in the literature. As a first example of an empirical investigation underpinned by our formalism, we identify recurring anatomical components of attention, which we exhaustively recombine to explore a space of variations on the attention mechanism.
Machine Learning,Category Theory
What problem does this paper attempt to address?
The paper primarily explores how to systematically represent and understand machine learning models, particularly attention mechanisms, by introducing a new diagrammatic form—string diagrams—to better comprehend and compare different deep learning architectures. The paper addresses two core issues: 1. **The trade-off between formal details and abstract perspectives**: When describing deep learning architectures, it is essential to maintain enough formality to ensure precision while also being able to conceptually understand the differences between models intuitively. 2. **The extension of formal expressive power**: Existing graphical representation methods often lack an inherent way to compare structural differences between different models. To solve these problems, the authors propose string diagrams as a new graphical representation form and define a set of rewriting rules based on this. String diagrams combine formality and intuitiveness, allowing researchers to freely switch between different levels of abstraction. At the same time, the rewriting rules enable formal exploration of the relationships between models. Specifically regarding attention mechanisms, the paper first classifies attention mechanisms and constructs a taxonomy of attention variants. Then, the authors perform a formal analysis of attention mechanisms based on string diagrams and rewriting rules, and through empirical studies, they explore the impact of attention mechanism structures on performance. The experimental section lists and tests a series of common attention components, combining them into various attention mechanisms, and evaluates their performance on word-level language modeling tasks. Ultimately, the paper finds that different attention mechanism structures seem to have little impact on their performance in representative tasks, suggesting that the specific structure of the attention mechanism may not be the key factor determining model performance. This conclusion challenges the current understanding of the internal workings of Transformer models and implies that other types of models or larger-scale attention mechanisms may have better performance.