Abstract:Transformers, especially the decoder-only variants, are the backbone of most modern large language models; yet we do not have much understanding of their expressive power except for the simple $1$-layer case. Due to the difficulty of analyzing multi-layer models, all previous work relies on unproven complexity conjectures to show limitations for multi-layer Transformers. In this work, we prove the first $\textit{unconditional}$ lower bound against multi-layer decoder-only transformers. For any constant $L$, we prove that any $L$-layer decoder-only transformer needs a polynomial model dimension ($n^{\Omega(1)}$) to perform sequential composition of $L$ functions over an input of $n$ tokens. As a consequence, our results give: (1) the first depth-width trade-off for multi-layer transformers, exhibiting that the $L$-step composition task is exponentially harder for $L$-layer models compared to $(L+1)$-layer ones; (2) an unconditional separation between encoder and decoder, exhibiting a hard task for decoders that can be solved by an exponentially shallower and smaller encoder; (3) a provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought. On the technical side, we propose the multi-party $\textit{autoregressive}$ $\textit{communication}$ $\textit{model}$ that captures the computation of a decoder-only Transformer. We also introduce a new proof technique that finds a certain $\textit{indistinguishable}$ $\textit{decomposition}$ of all possible inputs iteratively for proving lower bounds in this model. We believe our new communication model and proof technique will be helpful to further understand the computational power of transformers.

Analysis on the Number of Layers in the Transformer-Based Model for Neural Machine Translation

Layer-Wise Coordination Between Encoder and Decoder for Neural Machine Translation

Learning Language-Specific Layers for Multilingual Machine Translation

Supplementary material: On Layer Norm in the Transformer Architecture

Theoretical limitations of multi-layer Transformer

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers

Improving Neural Machine Translation Model with Deep Encoding Information

Layer-Wise Multi-View Learning for Neural Machine Translation

Scaling Laws for Neural Machine Translation

Optimizing Deep Transformers for Chinese-Thai Low-Resource Translation

Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

Deep Transformers with Latent Depth

X-Transformer: A Machine Translation Model Enhanced by the Self-Attention Mechanism

Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers

Rethinking the Value of Transformer Components

Are More Layers Beneficial to Graph Transformers?

Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks

Exploring English Long Sentence Translation Methods by Applying Natural Language Processing Techniques

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Modeling Discourse Structure for Document-level Neural Machine Translation

Deep Neural Machine Translation with Linear Associative Unit