When Can Transformers Count to n?

Gilad Yehudai,Haim Kaplan,Asma Ghandeharioun,Mor Geva,Amir Globerson
2024-10-07
Abstract:Large language models based on the transformer architectures can solve highly complex tasks. But are there simple tasks that such models cannot solve? Here we focus on very simple counting tasks, that involve counting how many times a token in the vocabulary have appeared in a string. We show that if the dimension of the transformer state is linear in the context length, this task can be solved. However, the solution we propose does not scale beyond this limit, and we provide theoretical arguments for why it is likely impossible for a size limited transformer to implement this task. Our empirical results demonstrate the same phase-transition in performance, as anticipated by the theoretical argument. Our results demonstrate the importance of understanding how transformers can solve simple tasks.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper explores the capabilities and limitations of large-scale language models (LLMs) based on the Transformer architecture in solving simple counting tasks. Specifically, the paper focuses on a very simple "query counting" task, where given a sequence composed of tokens from a vocabulary, the model needs to count the number of times a query token appears in the sequence. For example: Consider the sequence a a b b a c c d a. How many times does the letter "a" appear in this sequence? Through theoretical analysis and empirical research, the authors investigate the counting ability of Transformers under different conditions. The main research questions include: 1. **When can Transformers count**: When the embedding dimension of the Transformer is greater than the vocabulary size, the Transformer can successfully solve the counting task. 2. **When can't Transformers count**: When the embedding dimension is less than or equal to the vocabulary size, the Transformer struggles to solve the counting task, and this difficulty is independent of the number of parameters in the model. 3. **Complexity of the counting task**: The paper also studies a slightly more complex task—the "Most Frequent Element," where given a sequence of tokens, the model needs to identify the most frequently occurring token and its count. ### Main Findings 1. **Embedding dimension greater than vocabulary size**: When the embedding dimension \( d \) is greater than the vocabulary size \( m \), the counting task can be solved by constructing a histogram. 2. **Embedding dimension less than vocabulary size**: When \( d < m \), the simple histogram method is no longer applicable, requiring more complex solutions. However, these solutions typically need larger multi-layer perceptrons (MLPs), and as the input length increases, these solutions become impractical. 3. **Consistency between theoretical and empirical results**: Both theoretical analysis and experimental results indicate that in the case of \( d < m \), the performance of Transformers in long-context tasks significantly declines. ### Significance of the Research This research reveals the limitations of Transformers in handling simple counting tasks, emphasizing the importance of understanding the fundamental capabilities of the Transformer architecture. These findings not only contribute to a deeper understanding of how Transformers work but also highlight potential challenges in practical applications, especially when dealing with long input sequences. Additionally, the results underscore the advantage of using code as a tool to circumvent these issues.