Abstract:We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily explores the performance of the Transformer architecture in solving specific computational tasks and compares it with other neural network architectures. Specifically, the paper addresses the following core issues: 1. **The Relationship Between Transformers and Massively Parallel Computation (MPC)**: - The paper demonstrates that constant-layer self-attention layers can efficiently simulate and be simulated by massively parallel computation protocols. Through this relationship, the paper proves that logarithmic-depth Transformers can solve some fundamental computational tasks that other types of neural sequence models cannot effectively solve. 2. **Capabilities and Limitations of Logarithmic-Depth Transformers**: - The paper shows that logarithmic-depth Transformers have unique advantages in solving certain basic algorithmic tasks. For example, they can solve tasks such as bracket matching and Boolean formula evaluation, which other architectures like Graph Neural Networks (GNNs) and Recurrent Models cannot efficiently accomplish. 3. **k-hop Induction Head Task**: - The paper proposes a synthetic sequence modeling task called the "k-hop induction head task" and studies it in detail from both theoretical and experimental perspectives. The results show that logarithmic-depth Transformers can effectively solve this task, while other architectures cannot achieve the same parameter efficiency. 4. **Separation Between Different Architectures**: - The paper also explores the performance differences between Transformers and several other common architectures (such as GNNs, Recurrent Models, and Transformers using efficient alternative methods) in solving specific tasks and provides lower bound analyses of parameter complexity. Through the above research, the paper reveals the advantages of the Transformer architecture in parallel processing and theoretically and empirically demonstrates its superiority in solving certain specific computational tasks.

Transformers, parallel computation, and logarithmic depth

Representational Strengths and Limitations of Transformers

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Theoretical limitations of multi-layer Transformer

Transformers Can Do Arithmetic with the Right Embeddings

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Fundamental Limitations on Subquadratic Alternatives to Transformers

Learning Linear Attention in Polynomial Time

Transformers on Markov Data: Constant Depth Suffices

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers

Transformers are Universal In-context Learners

SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity

Average-Hard Attention Transformers are Constant-Depth Uniform Threshold Circuits

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers

A Theory for Compressibility of Graph Transformers for Transductive Learning

An Intrinsic Dimension Perspective of Transformers for Sequential Modeling