How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding

Yuchen Li,Yuanzhi Li,Andrej Risteski
2023-07-25
Abstract:While the successes of transformers across many domains are indisputable, accurate understanding of the learning mechanics is still largely lacking. Their capabilities have been probed on benchmarks which include a variety of structured and reasoning tasks -- but mathematical understanding is lagging substantially behind. Recent lines of work have begun studying representational aspects of this question: that is, the size/depth/complexity of attention-based networks to perform certain tasks. However, there is no guarantee the learning dynamics will converge to the constructions proposed. In our paper, we provide fine-grained mechanistic understanding of how transformers learn "semantic structure", understood as capturing co-occurrence structure of words. Precisely, we show, through a combination of mathematical analysis and experiments on Wikipedia data and synthetic data modeled by Latent Dirichlet Allocation (LDA), that the embedding layer and the self-attention layer encode the topical structure. In the former case, this manifests as higher average inner product of embeddings between same-topic words. In the latter, it manifests as higher average pairwise attention between same-topic words. The mathematical results involve several assumptions to make the analysis tractable, which we verify on data, and might be of independent interest as well.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper attempts to solve the problem of understanding the specific mechanisms by which Transformers learn "semantic structures." Specifically, the authors focus on how Transformers capture thematic structures through word co-occurrence patterns. The paper mathematically analyzes and experimentally verifies how the embedding layer and self-attention layer encode thematic structures during training. ### Main Findings 1. **Embedding Layer Encodes Thematic Structures**: - In the embedding layer, the inner product of embedding vectors between words within the same theme is larger, while the inner product between words of different themes is smaller. - This phenomenon is validated in both synthetic data (data generated based on LDA) and real data (such as Wikipedia data). 2. **Self-Attention Layer Encodes Thematic Structures**: - In the self-attention layer, the average attention weights between words within the same theme are higher, while the attention weights between words of different themes are lower. - The authors demonstrate the formation mechanism of this structure through a two-stage training dynamic analysis. In the first stage, the Value Matrix learns a block structure; in the second stage, the Key Matrix and Query Matrix start to adjust, further optimizing the attention weights. ### Research Methods - **Theoretical Analysis**: - The authors analyze the optimization dynamics under the masked language modeling objective using a simplified one-layer Transformer model. - They assume the data distribution follows a thematic model and mathematically prove the behavior of the embedding layer and self-attention layer in learning thematic structures. - **Experimental Verification**: - Experiments are conducted using synthetic data (data generated based on LDA) and real data (such as Wikipedia data) to verify the theoretical analysis results. - The experimental results show that even under different loss functions and optimizer settings, the embedding layer and self-attention layer can effectively encode thematic structures. ### Conclusion Through theoretical analysis and experimental verification, the paper reveals the specific mechanisms by which Transformers learn thematic structures. These findings not only enhance the understanding of how Transformers work but also provide a theoretical foundation for future model design and optimization.