Round and Round We Go! What makes Rotary Positional Encodings useful?

Federico Barbero,Alex Vitvitskyi,Christos Perivolaropoulos,Razvan Pascanu,Petar Veličković
2024-10-09
Abstract:Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to understand the specific mechanism of Rotary Positional Encodings (RoPE) in the Transformer model. Although RoPE is widely used in large - language models (LLMs), the specific reasons for its effectiveness are still unclear. By conducting in - depth research on the pre - trained Gemma 7B model, the paper explores the following key issues: 1. **Does RoPE help the model by attenuating the attention coefficients?** - The paper refutes the common view that RoPE helps the model by attenuating the attention coefficients as the relative distance increases. The authors provide theoretical and experimental evidence, indicating that this attenuation is not the core advantage of RoPE. 2. **The role of different frequencies**: - The study finds that Gemma 7B tends to use the low - frequency part of RoPE to carry semantic information, while the high - frequency part is used to construct robust positional attention patterns. This shows that different frequencies play different roles in the model. 3. **How does the high - frequency part construct the positional attention pattern?** - The authors prove that RoPE can construct specific "positional" attention heads through the high - frequency part. These heads can focus on specific relative positions without relying on semantic information. In contrast, without positional encoding (NoPE), this function cannot be achieved. 4. **How does the low - frequency part process semantic information?** - The low - frequency part is relatively stable to changes in relative distance, so it is more suitable for processing semantic information. However, when the context length is too long, the low - frequency part will gradually lose its stability. For this reason, the authors propose a new method, p - RoPE, which creates more robust semantic channels by truncating the lowest frequency, and proves the effectiveness of this method in experiments. ### Main contributions - **Refuting common assumptions**: Through theory and experiments, it is proved that RoPE does not necessarily play a role by attenuating the attention coefficients. - **Revealing the frequency - use mechanism**: The preference of Gemma 7B for different frequencies and their roles in constructing attention patterns are discovered. - **Proposing an improvement plan**: The p - RoPE method is introduced to solve the problem of instability of the low - frequency part in long contexts and improve the model performance. ### Summary This paper reveals the actual working principle of RoPE in the Transformer model through in - depth analysis and proposes an improvement plan, providing a new perspective for understanding and optimizing large - language models.