When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Haonan Wang,Qian Liu,Chao Du,Tongyao Zhu,Cunxiao Du,Kenji Kawaguchi,Tianyu Pang
2024-11-21
Abstract:Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at <a class="link-external link-https" href="https://github.com/haonan3/AnchorContext" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the failure of the relative position encoding feature of Rotary Positional Embedding (RoPE) when using the BFloat16 floating - point format for long - context training. Specifically, the limited precision of BFloat16 causes the position encoding of RoPE to gradually deviate from its theoretical value in long sequences, especially being particularly obvious on the first token. As the training window size increases, numerical errors accumulate, further exacerbating this problem. To address this challenge, the authors propose a new method named **AnchorAttention**. AnchorAttention reduces unnecessary attention calculations, maintains semantic coherence, and improves computational efficiency by using the first token as a shared anchor point and assigning it a consistent position ID. This method not only solves the numerical problems brought by BFloat16 but also significantly improves the performance of long - context tasks and shortens the training time by more than 50%. ### Main contributions: 1. **Discovering the influence of BFloat16 on RoPE's relative position encoding**: The authors find that under the BFloat16 precision, the relative position encoding feature of RoPE will fail, especially in long - context scenarios. 2. **Identifying the contribution of the first token to RoPE's relative feature**: The position encoding deviation of the first token should be theoretically preserved, but in reality, it deviates significantly, and this deviation becomes more obvious as the training window size increases. 3. **Introducing the AnchorAttention method**: This is a practical method that can improve the model's ability to handle long - context, reduce training time, and requires minimal modification to the existing training pipeline. ### Method overview: - **Background**: Modern large - language models (LLMs) are mainly based on the Transformer architecture, in which the attention mechanism is a core component. RoPE operates on query and key vectors through rotation matrices to achieve efficient relative position encoding. - **Problem analysis**: The authors find through experiments that the low precision of BFloat16 causes the relative position encoding feature of RoPE in long - context to fail. In particular, the position encoding deviation of the first token is significant, and this deviation gradually accumulates as the sequence length increases. - **Solution**: AnchorAttention solves the above problems by using the first token as a shared anchor point and assigning it a consistent position ID. This method reduces unnecessary attention calculations, maintains semantic coherence, and improves computational efficiency. ### Experimental results: - **Performance improvement**: The experimental results show that the model trained with AnchorAttention performs well in long - context benchmarks (such as RULER and LongBench), not only improving performance in long - context tasks but also maintaining the model's ability in general tasks (such as MMLU and HellaSwag). - **Reduction in training time**: AnchorAttention shortens the training time by more than 50%, significantly improving training efficiency. In conclusion, this paper effectively solves the influence of BFloat16 on the relative position encoding feature of RoPE in long - context training by proposing the AnchorAttention method, providing new ideas and solutions for the training of long - context models.