Abstract:We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop. Our input sequence acts as a punchcard, consisting of instructions and memory for data read/writes. We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including embedding edit operations, non-linear functions, function calls, program counters, and conditional branches. Using these building blocks, we emulate a small instruction-set computer. This allows us to map iterative algorithms to programs that can be executed by a looped, 13-layer transformer. We show how this transformer, instructed by its input, can emulate a basic calculator, a basic linear algebra library, and in-context learning algorithms that employ backpropagation. Our work highlights the versatility of the attention mechanism, and demonstrates that even shallow transformers can execute full-fledged, general-purpose programs.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to explore how to design Transformer networks as general-purpose computers. Specifically, the authors program specific weights and place the Transformer network in a loop, enabling it to execute complex algorithms and programs. The input sequence acts as a "punch card," containing the instructions and memory required for data reading and writing. The paper demonstrates that a small number of coding layers can simulate basic computational modules, such as embedding edit operations, nonlinear functions, function calls, program counters, and conditional branches. Using these basic modules, the authors construct a small instruction set computer, allowing iterative algorithms to be mapped into a program composed of 13 layers of Transformers. ### Main Contributions 1. **Simulating Complex Algorithms**: The paper demonstrates that by hardcoding specific weights and placing them in a loop, Transformer networks can simulate complex algorithms and programs. 2. **Basic Computational Functions**: The authors construct Transformer networks capable of performing basic calculators, basic linear algebra libraries (matrix transposition, multiplication, inversion, power iteration), and context learning algorithms (such as backpropagation). 3. **General-Purpose Computing Capability**: The paper proves that there exists a looped Transformer with fewer than 13 layers that can simulate a general-purpose computer, basic calculator, numerical linear algebra methods, and context learning algorithms in neural networks. ### Methods - **Loop Structure**: By re-inputting the output sequence of the Transformer back into the network, forming a loop, the network can iteratively update hidden states and perform complex computations. - **SUBLEQ Language**: The authors design a Transformer capable of executing a simplified single-instruction language, SUBLEQ, which defines a one-instruction set computer (OISC). The SUBLEQ instruction includes three memory address operands, performing a subtraction operation and jumping based on the result. - **FLEQ Instruction**: Further extending SUBLEQ, the authors propose a more flexible single instruction, FLEQ, in the form of `mem[c] = fm(mem[a], mem[b])`, jumping to instruction `p` if `mem[flag] ≤ 0`, otherwise continuing to the next instruction. `fm` can be a function chosen from a set of functions (matrix multiplication, nonlinear functions, polynomials, etc.), which can be hardcoded into the network. ### Conclusion The paper demonstrates the potential of Transformer networks in simulating general-purpose computers, particularly in executing complex mathematical and algorithmic tasks. By designing specific loop structures and hardcoding weights, the authors successfully enable Transformer networks to perform various functions, including basic calculators, linear algebra operations, and context learning algorithms. These results highlight the flexibility and importance of the attention mechanism, providing new directions for future research.

Looped Transformers as Programmable Computers

Looped ReLU MLPs May Be All You Need as Practical Programmable Computers

Looped Transformers are Better at Learning Learning Algorithms

On the Expressive Power of a Variant of the Looped Transformer

On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

Simulation of Graph Algorithms with Looped Transformers

Neurocoder: Learning General-Purpose Computation Using Stored Neural Programs

Looped Transformers for Length Generalization

Transformers are Efficient Compilers, Provably

Learning Transformer Programs

Transformers, parallel computation, and logarithmic depth

Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?

Fourier Circuits in Neural Networks and Transformers: A Case Study of Modular Arithmetic with Multiple Inputs

Circuit Transformer: End-to-end Circuit Design by Predicting the Next Gate

Tracr: Compiled Transformers as a Laboratory for Interpretability

Graph Transformers Dream of Electric Flow

Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions

Linear Transformers are Versatile In-Context Learners

Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent

Learning Linear Attention in Polynomial Time

TreeCoders: Trees of Transformers