Transformers, parallel computation, and logarithmic depth

Clayton Sanford,Daniel Hsu,Matus Telgarsky
2024-02-14
Abstract:We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.
Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily explores the performance of the Transformer architecture in solving specific computational tasks and compares it with other neural network architectures. Specifically, the paper addresses the following core issues: 1. **The Relationship Between Transformers and Massively Parallel Computation (MPC)**: - The paper demonstrates that constant-layer self-attention layers can efficiently simulate and be simulated by massively parallel computation protocols. Through this relationship, the paper proves that logarithmic-depth Transformers can solve some fundamental computational tasks that other types of neural sequence models cannot effectively solve. 2. **Capabilities and Limitations of Logarithmic-Depth Transformers**: - The paper shows that logarithmic-depth Transformers have unique advantages in solving certain basic algorithmic tasks. For example, they can solve tasks such as bracket matching and Boolean formula evaluation, which other architectures like Graph Neural Networks (GNNs) and Recurrent Models cannot efficiently accomplish. 3. **k-hop Induction Head Task**: - The paper proposes a synthetic sequence modeling task called the "k-hop induction head task" and studies it in detail from both theoretical and experimental perspectives. The results show that logarithmic-depth Transformers can effectively solve this task, while other architectures cannot achieve the same parameter efficiency. 4. **Separation Between Different Architectures**: - The paper also explores the performance differences between Transformers and several other common architectures (such as GNNs, Recurrent Models, and Transformers using efficient alternative methods) in solving specific tasks and provides lower bound analyses of parameter complexity. Through the above research, the paper reveals the advantages of the Transformer architecture in parallel processing and theoretically and empirically demonstrates its superiority in solving certain specific computational tasks.