Abstract:The attention mechanism is central to the transformer's ability to capture complex dependencies between tokens of an input sequence. Key to the successful application of the attention mechanism in transformers is its choice of positional encoding (PE). The PE provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input sequences, such as those presented in natural language, where adjacent tokens (e.g., words) are highly related. In contrast, many real world tasks involve datasets with highly non-trivial positional arrangements, such as datasets organized in multiple spatial dimensions, or datasets for which ground truth positions are not known, such as in biological data. Here we study the importance of learning accurate PE for problems which rely on a non-trivial arrangement of input tokens. Critically, we find that the choice of initialization of a learnable PE greatly influences its ability to discover accurate PEs that lead to enhanced generalization. We empirically demonstrate our findings in a 2D relational reasoning task and a real world 3D neuroscience dataset, applying interpretability analyses to verify the learning of accurate PEs. Overall, we find that a learned PE initialized from a small-norm distribution can 1) uncover interpretable PEs that mirror ground truth positions, 2) learn non-trivial and modular PEs in a real-world neuroscience dataset, and 3) lead to improved downstream generalization in both datasets. Importantly, choosing an ill-suited PE can be detrimental to both model interpretability and generalization. Together, our results illustrate the feasibility of discovering accurate PEs for enhanced generalization.

Learning positional encodings in transformers depends on initialization

The Impact of Positional Encoding on Length Generalization in Transformers

GridPE: Unifying Positional Encoding in Transformers with a Grid Cell-Inspired Framework

Positional Encodings for Light Curve Transformers: Playing with Positions and Attention

Rethinking Positional Encoding in Language Pre-training

Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph

Rethinking and Improving Relative Position Encoding for Vision Transformer

Complex-Valued Relative Positional Encodings for Transformer

Explore Better Relative Position Embeddings from Encoding Perspective for Transformer Models.

A Simple and Effective Positional Encoding for Transformers

Enhancing multivariate time-series anomaly detection with positional encoding mechanisms in transformers

HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

Memory Positional Encoding for Image Captioning

HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding

Your Transformer May Not be as Powerful as You Expect

Positional Encoding Helps Recurrent Neural Networks Handle a Large Vocabulary

PoPE: Legendre Orthogonal Polynomials Based Position Encoding for Large Language Models

Positional Attention: Out-of-Distribution Generalization and Expressivity for Neural Algorithmic Reasoning

Position Embedding Needs an Independent Layer Normalization

Conditional Positional Encodings for Vision Transformers