Abstract:It has been observed in recent years that transformers have problems with length generalization for certain types of reasoning and arithmetic tasks. In particular, the performance of a transformer model trained on tasks (say addition) up to a certain length (e.g., 5 digit numbers) drops sharply when applied to longer instances of the same problem. This work proposes an approach based on task hinting towards addressing length generalization. Our key idea is that while training the model on task-specific data, it is helpful to simultaneously train the model to solve a simpler but related auxiliary task as well. We study the classical sorting problem as a canonical example to evaluate our approach. We design a multitask training framework and show that task hinting significantly improve length generalization. For sorting we show that it is possible to train models on data consisting of sequences having length at most $20$, and improve the test accuracy on sequences of length $100$ from less than 1% (for standard training) to more than 92% (via task hinting). Our study uncovers several interesting aspects of length generalization. We observe that while several auxiliary tasks may seem natural a priori, their effectiveness in improving length generalization differs dramatically. We further use probing and visualization-based techniques to understand the internal mechanisms via which the model performs the task, and propose a theoretical construction consistent with the observed learning behaviors of the model. Based on our construction, we show that introducing a small number of length dependent parameters into the training procedure can further boost the performance on unseen lengths. Finally, we also show the efficacy of our task hinting based approach beyond sorting, giving hope that these techniques will be applicable in broader contexts.

Arbitrary-Length Generalization for Addition in a Tiny Transformer

Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count

From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Transformers Can Achieve Length Generalization But Not Robustly

Looped Transformers for Length Generalization

Universal Length Generalization with Turing Programs

Investigating the Limitations of Transformers with Simple Arithmetic Tasks

What Algorithms can Transformers Learn? A Study in Length Generalization

Transformers discover an elementary calculation system exploiting local attention and grid-like problem representation

Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks

Transformers Can Do Arithmetic with the Right Embeddings

Improving Length-Generalization in Transformers via Task Hinting

Teaching Arithmetic to Small Transformers

Increasing transformer token length with a Maximum Entropy Principle Method

Generalizable Memory-driven Transformer for Multivariate Long Sequence Time-series Forecasting

It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models

Understanding Addition in Transformers

Teaching Transformers Modular Arithmetic at Scale

Positional Description Matters for Transformers Arithmetic

Carrying over algorithm in transformers