Abstract:Large language models (LLMs) have achieved remarkable proficiency on solving diverse problems. However, their generalization ability is not always satisfying and the generalization problem is common for generative transformer models in general. Researchers take basic mathematical tasks like n-digit addition or multiplication as important perspectives for investigating their generalization behaviors. It is observed that when training models on n-digit operations (e.g., additions) in which both input operands are n-digit in length, models generalize successfully on unseen n-digit inputs (in-distribution (ID) generalization), but fail miserably on longer, unseen cases (out-of-distribution (OOD) generalization). We bring this unexplained performance drop into attention and ask whether there is systematic OOD generalization. Towards understanding LLMs, we train various smaller language models which may share the same underlying mechanism. We discover that the strong ID generalization stems from structured representations, while behind the unsatisfying OOD performance, the models still exhibit clear learned algebraic structures. Specifically, these models map unseen OOD inputs to outputs with learned equivalence relations in the ID domain, which we call the equivalence generalization. These findings deepen our knowledge regarding the generalizability of generative models including LLMs, and provide insights into potential avenues for improvement.

What problem does this paper attempt to address?

The paper primarily explores the generalization capability of large language models (LLMs) when handling mathematical operations, especially their performance on tasks that fall outside the training data distribution (Out-of-Distribution, OOD). Specifically: 1. **Research Background**: Although large language models excel in fields such as natural language processing, code translation, and mathematical reasoning, their generalization ability on certain tasks is not satisfactory, particularly when dealing with tasks that exceed the length of the training data. 2. **Core Issue**: When training models to perform n-digit addition or multiplication, the models can perform well on unseen n-digit inputs (i.e., in-distribution generalization, In-Distribution, ID). However, their performance drops sharply when faced with longer, unseen inputs (i.e., the OOD generalization problem). 3. **Research Findings**: - Despite the models' excellent performance in ID generalization, there is still a clear learning of algebraic structures in OOD generalization. These models map unseen OOD inputs to equivalent relation outputs within the ID domain, a phenomenon referred to as "equivalent generalization." - Models achieve ID generalization by learning structured representations, and the failure in OOD generalization is due to these representations being inadvertently extended to OOD inputs, leading to systematic errors rather than random errors. 4. **Contributions**: - Provides a mechanistic empirical evaluation method by training small generative language models to directly explore the differences between ID and OOD generalization. - Discovers the structures learned by models during OOD generalization, which helps in proposing more robust solutions. - Understands the role of representation learning in the generalization process and demonstrates how it affects the models' performance on OOD inputs. In summary, this paper aims to reveal the fundamental reasons for the insufficient OOD generalization capability of large language models when handling mathematical operations and provides new insights for improving the models' generalization ability through the study of their internal mechanisms.

It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models

Out-of-distribution generalization via composition: a lens through induction heads in Transformers

Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks

Dissecting Multiplication in Transformers: Insights into LLMs

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

On the generalization capacity of neural networks during generic multimodal reasoning

Can In-context Learning Really Generalize to Out-of-distribution Tasks?

Interpretability Illusions in the Generalization of Simplified Models

Why Larger Language Models Do In-context Learning Differently?

What Algorithms can Transformers Learn? A Study in Length Generalization

From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers

Understanding the Difficulty of Training Transformers

Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically

Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell

To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?

Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics

Adaptivity and Modularity for Efficient Generalization Over Task Complexity

Investigating the Limitations of Transformers with Simple Arithmetic Tasks

Regurgitative Training: The Value of Real Data in Training Large Language Models