Looped Transformers as Programmable Computers

Angeliki Giannou,Shashank Rajput,Jy-yong Sohn,Kangwook Lee,Jason D. Lee,Dimitris Papailiopoulos
DOI: https://doi.org/10.48550/arXiv.2301.13196
2023-01-31
Abstract:We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop. Our input sequence acts as a punchcard, consisting of instructions and memory for data read/writes. We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including embedding edit operations, non-linear functions, function calls, program counters, and conditional branches. Using these building blocks, we emulate a small instruction-set computer. This allows us to map iterative algorithms to programs that can be executed by a looped, 13-layer transformer. We show how this transformer, instructed by its input, can emulate a basic calculator, a basic linear algebra library, and in-context learning algorithms that employ backpropagation. Our work highlights the versatility of the attention mechanism, and demonstrates that even shallow transformers can execute full-fledged, general-purpose programs.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to explore how to design Transformer networks as general-purpose computers. Specifically, the authors program specific weights and place the Transformer network in a loop, enabling it to execute complex algorithms and programs. The input sequence acts as a "punch card," containing the instructions and memory required for data reading and writing. The paper demonstrates that a small number of coding layers can simulate basic computational modules, such as embedding edit operations, nonlinear functions, function calls, program counters, and conditional branches. Using these basic modules, the authors construct a small instruction set computer, allowing iterative algorithms to be mapped into a program composed of 13 layers of Transformers. ### Main Contributions 1. **Simulating Complex Algorithms**: The paper demonstrates that by hardcoding specific weights and placing them in a loop, Transformer networks can simulate complex algorithms and programs. 2. **Basic Computational Functions**: The authors construct Transformer networks capable of performing basic calculators, basic linear algebra libraries (matrix transposition, multiplication, inversion, power iteration), and context learning algorithms (such as backpropagation). 3. **General-Purpose Computing Capability**: The paper proves that there exists a looped Transformer with fewer than 13 layers that can simulate a general-purpose computer, basic calculator, numerical linear algebra methods, and context learning algorithms in neural networks. ### Methods - **Loop Structure**: By re-inputting the output sequence of the Transformer back into the network, forming a loop, the network can iteratively update hidden states and perform complex computations. - **SUBLEQ Language**: The authors design a Transformer capable of executing a simplified single-instruction language, SUBLEQ, which defines a one-instruction set computer (OISC). The SUBLEQ instruction includes three memory address operands, performing a subtraction operation and jumping based on the result. - **FLEQ Instruction**: Further extending SUBLEQ, the authors propose a more flexible single instruction, FLEQ, in the form of `mem[c] = fm(mem[a], mem[b])`, jumping to instruction `p` if `mem[flag] ≤ 0`, otherwise continuing to the next instruction. `fm` can be a function chosen from a set of functions (matrix multiplication, nonlinear functions, polynomials, etc.), which can be hardcoded into the network. ### Conclusion The paper demonstrates the potential of Transformer networks in simulating general-purpose computers, particularly in executing complex mathematical and algorithmic tasks. By designing specific loop structures and hardcoding weights, the authors successfully enable Transformer networks to perform various functions, including basic calculators, linear algebra operations, and context learning algorithms. These results highlight the flexibility and importance of the attention mechanism, providing new directions for future research.