LieRE: Generalizing Rotary Position Encodings

Sophie Ostmeier,Brian Axelrod,Michael E. Moseley,Akshay Chaudhari,Curtis Langlotz
2024-10-18
Abstract:While Rotary Position Embeddings (RoPE) for large language models have become widely adopted, their application for other modalities has been slower. Here, we introduce Lie group Relative position Encodings (LieRE) that goes beyond RoPE in supporting n-dimensional inputs. We evaluate the performance of LieRE on 2D and 3D image classification tasks and observe that LieRE leads to marked relative improvements in performance (up to 9.7% for 2D and up to 25.5% for 3D), training efficiency (3.5x reduction), data efficiency (30%) compared to the baselines of DeiT III, RoPE-Mixed and Vision-Llama. <a class="link-external link-https" href="https://github.com/Stanford-AIMI/LieRE" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of how to effectively encode relative position information in multimodalities (especially 2D and 3D data). Specifically, the paper proposes a new position encoding method - Lie Group Relative Position Encoding (LieRE) to overcome the limitations of existing position encoding methods (such as RoPE) when dealing with high - dimensional data. #### Background and Motivation 1. **Position Information in the Transformer Architecture**: - The attention mechanism in the Transformer model is invariant to the input order, so an additional mechanism is required to capture the position information of the input tokens. - Currently, commonly used position encoding methods include absolute position encoding, relative position encoding, and context position encoding. 2. **Limitations of Existing Methods**: - Rotary Position Embeddings (RoPE) is a successful position encoding method, especially suitable for one - dimensional sequence data (such as text). However, it performs poorly when dealing with higher - dimensional data (such as images and videos), resulting in a limited application range. - RoPE was originally designed to handle one - dimensional sequence data, which makes it less effective when dealing with two - or three - dimensional data, especially in data involving the time dimension (such as videos). #### Research Objectives The main objective of the paper is to develop a general - purpose position encoding scheme that can work effectively in different modalities (such as 2D images and 3D videos). Specifically: - **Expand the Application Range of RoPE**: By introducing LieRE, the position encoding method can support n - dimensional inputs, rather than being limited to one - dimensional sequence data. - **Improve Performance**: In 2D and 3D classification tasks, LieRE shows a significant performance improvement (for example, up to a 9.7% improvement in 2D tasks and up to a 25.5% improvement in 3D tasks). - **Improve Training Efficiency and Data Efficiency**: LieRE not only improves classification accuracy but also reduces the computational resources and data volume required for training. For example, in the CIFAR100 task, LieRE only needs 3.5 times fewer training steps to achieve the same accuracy, and can outperform the baseline model's performance with only 70% of the data. #### Main Contributions - **Introduce LieRE**: LieRE is a position encoding method based on Lie group theory, which can effectively learn how to utilize the relative spatial information of the input. - **Cross - Modal Applicability**: LieRE is not only applicable to 2D images but can also be applied to high - dimensional data such as 3D videos, demonstrating its wide applicability. - **Simplify Model Structure**: LieRE allows the use of a simpler general - purpose model backbone to handle multiple tasks, thereby simplifying the model design. Through these improvements, LieRE provides a more efficient and flexible position encoding scheme for processing multimodal data.