Abstract:Large language models based on the transformer architectures can solve highly complex tasks. But are there simple tasks that such models cannot solve? Here we focus on very simple counting tasks, that involve counting how many times a token in the vocabulary have appeared in a string. We show that if the dimension of the transformer state is linear in the context length, this task can be solved. However, the solution we propose does not scale beyond this limit, and we provide theoretical arguments for why it is likely impossible for a size limited transformer to implement this task. Our empirical results demonstrate the same phase-transition in performance, as anticipated by the theoretical argument. Our results demonstrate the importance of understanding how transformers can solve simple tasks.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper explores the capabilities and limitations of large-scale language models (LLMs) based on the Transformer architecture in solving simple counting tasks. Specifically, the paper focuses on a very simple "query counting" task, where given a sequence composed of tokens from a vocabulary, the model needs to count the number of times a query token appears in the sequence. For example: Consider the sequence a a b b a c c d a. How many times does the letter "a" appear in this sequence? Through theoretical analysis and empirical research, the authors investigate the counting ability of Transformers under different conditions. The main research questions include: 1. **When can Transformers count**: When the embedding dimension of the Transformer is greater than the vocabulary size, the Transformer can successfully solve the counting task. 2. **When can't Transformers count**: When the embedding dimension is less than or equal to the vocabulary size, the Transformer struggles to solve the counting task, and this difficulty is independent of the number of parameters in the model. 3. **Complexity of the counting task**: The paper also studies a slightly more complex task—the "Most Frequent Element," where given a sequence of tokens, the model needs to identify the most frequently occurring token and its count. ### Main Findings 1. **Embedding dimension greater than vocabulary size**: When the embedding dimension \( d \) is greater than the vocabulary size \( m \), the counting task can be solved by constructing a histogram. 2. **Embedding dimension less than vocabulary size**: When \( d < m \), the simple histogram method is no longer applicable, requiring more complex solutions. However, these solutions typically need larger multi-layer perceptrons (MLPs), and as the input length increases, these solutions become impractical. 3. **Consistency between theoretical and empirical results**: Both theoretical analysis and experimental results indicate that in the case of \( d < m \), the performance of Transformers in long-context tasks significantly declines. ### Significance of the Research This research reveals the limitations of Transformers in handling simple counting tasks, emphasizing the importance of understanding the fundamental capabilities of the Transformer architecture. These findings not only contribute to a deeper understanding of how Transformers work but also highlight potential challenges in practical applications, especially when dealing with long input sequences. Additionally, the results underscore the advantage of using code as a tool to circumvent these issues.

When Can Transformers Count to n?

Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

Counting Ability of Large Language Models and Impact of Tokenization

Transformers Can Represent $n$-gram Language Models

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Understanding Transformers via N-gram Statistics

Language Models Need Inductive Biases to Count Inductively

On the Ability and Limitations of Transformers to Recognize Formal Languages

Can Transformers Learn $n$-gram Language Models?

Investigating the Limitations of Transformers with Simple Arithmetic Tasks

Faith and Fate: Limits of Transformers on Compositionality

Transformers are Efficient Compilers, Provably

Transformers Can Do Arithmetic with the Right Embeddings

Transformers are Multi-State RNNs

A mathematical perspective on Transformers

Transformers are Universal In-context Learners

Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers

How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis

Understanding Transformer Reasoning Capabilities via Graph Algorithms

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems