Abstract:Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of counting, investigating length generalization does occur throughout the literature. In the "train short, test long" paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature.

What problem does this paper attempt to address?

The main problem this paper attempts to address is **the poor performance of Transformer models in handling out-of-distribution (OOD) counting tasks**. Specifically, the paper focuses on the following aspects: 1. **Inductive counting ability**: - The paper explores whether Transformer models can handle counting tasks beyond the length of training data through inductive learning. Traditional RNN models can easily achieve inductive counting, but Transformer models rely on positional embeddings when dealing with OOD counting. 2. **Comparison of different architectures**: - The authors conducted experiments on various neural network architectures, including RNN, Transformer, State-Space Models, and RWKV, to compare their performance on counting tasks. The experimental results show that modern RNN architectures perform worse in inductive counting compared to traditional RNNs, while Transformer models require specific positional embeddings to effectively handle OOD counting. 3. **Impact of positional embeddings**: - The paper investigates the impact of different types of positional embeddings (such as SinePE, APE, RoPE, SPE, and NoPE) on the performance of Transformer models in counting tasks. The results indicate that certain positional embeddings (such as RoPE) perform poorly in inductive counting, while others (such as SinePE and APE) perform better. 4. **Modular and selective counting**: - In addition to basic counting tasks, the paper also explores the performance of Transformer models in modular counting and selective counting tasks. Modular counting refers to cyclic counting within a limited counting state, while selective counting refers to counting only the preceding items that meet specific conditions. The experimental results show that Transformer models also have certain limitations in these two tasks, especially when dealing with OOD data. 5. **Experimental design and results**: - The paper designed various input-output formats and auxiliary tasks to overcome the issues of OOD positions and vocabulary. The experimental results indicate that while shallow Transformer models perform poorly in inductive counting, deep Transformer models (such as 4 layers) can achieve better generalization ability with the help of certain positional embeddings. In summary, this paper reveals the limitations of Transformer models in handling inductive counting tasks through a series of experiments and proposes directions for improvement, particularly in the design and selection of positional embeddings. These findings are significant for understanding the computational capabilities of Transformer models and potential methods for improvement.

Language Models Need Inductive Biases to Count Inductively

Language Models Need Inductive Biases to Count Inductively

Counting Ability of Large Language Models and Impact of Tokenization

When Can Transformers Count to n?

Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

What Algorithms can Transformers Learn? A Study in Length Generalization

Out-of-distribution generalization via composition: a lens through induction heads in Transformers

Injecting structural hints: Using language models to study inductive biases in language learning

Exploring the Long-Term Generalization of Counting Behavior in RNNs

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models

Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically

LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems

Counting in Language with RNNs

Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

Rarely a problem? Language models exhibit inverse scaling in their predictions following few-type quantifiers

Examining the Inductive Bias of Neural Language Models with Artificial Languages

Investigating the Limitations of Transformers with Simple Arithmetic Tasks

How to Plant Trees in Language Models: Data and Architectural Effects on the Emergence of Syntactic Inductive Biases

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

Towards Understanding Inductive Bias in Transformers: A View From Infinity