Language Models Need Inductive Biases to Count Inductively

Yingshan Chang,Yonatan Bisk
2024-10-25
Abstract:Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of counting, investigating length generalization does occur throughout the literature. In the "train short, test long" paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem this paper attempts to address is **the poor performance of Transformer models in handling out-of-distribution (OOD) counting tasks**. Specifically, the paper focuses on the following aspects: 1. **Inductive counting ability**: - The paper explores whether Transformer models can handle counting tasks beyond the length of training data through inductive learning. Traditional RNN models can easily achieve inductive counting, but Transformer models rely on positional embeddings when dealing with OOD counting. 2. **Comparison of different architectures**: - The authors conducted experiments on various neural network architectures, including RNN, Transformer, State-Space Models, and RWKV, to compare their performance on counting tasks. The experimental results show that modern RNN architectures perform worse in inductive counting compared to traditional RNNs, while Transformer models require specific positional embeddings to effectively handle OOD counting. 3. **Impact of positional embeddings**: - The paper investigates the impact of different types of positional embeddings (such as SinePE, APE, RoPE, SPE, and NoPE) on the performance of Transformer models in counting tasks. The results indicate that certain positional embeddings (such as RoPE) perform poorly in inductive counting, while others (such as SinePE and APE) perform better. 4. **Modular and selective counting**: - In addition to basic counting tasks, the paper also explores the performance of Transformer models in modular counting and selective counting tasks. Modular counting refers to cyclic counting within a limited counting state, while selective counting refers to counting only the preceding items that meet specific conditions. The experimental results show that Transformer models also have certain limitations in these two tasks, especially when dealing with OOD data. 5. **Experimental design and results**: - The paper designed various input-output formats and auxiliary tasks to overcome the issues of OOD positions and vocabulary. The experimental results indicate that while shallow Transformer models perform poorly in inductive counting, deep Transformer models (such as 4 layers) can achieve better generalization ability with the help of certain positional embeddings. In summary, this paper reveals the limitations of Transformer models in handling inductive counting tasks through a series of experiments and proposes directions for improvement, particularly in the design and selection of positional embeddings. These findings are significant for understanding the computational capabilities of Transformer models and potential methods for improvement.