Abstract:While Rotary Position Embeddings (RoPE) for large language models have become widely adopted, their application for other modalities has been slower. Here, we introduce Lie group Relative position Encodings (LieRE) that goes beyond RoPE in supporting n-dimensional inputs. We evaluate the performance of LieRE on 2D and 3D image classification tasks and observe that LieRE leads to marked relative improvements in performance (up to 9.7% for 2D and up to 25.5% for 3D), training efficiency (3.5x reduction), data efficiency (30%) compared to the baselines of DeiT III, RoPE-Mixed and Vision-Llama. <a class="link-external link-https" href="https://github.com/Stanford-AIMI/LieRE" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of how to effectively encode relative position information in multimodalities (especially 2D and 3D data). Specifically, the paper proposes a new position encoding method - Lie Group Relative Position Encoding (LieRE) to overcome the limitations of existing position encoding methods (such as RoPE) when dealing with high - dimensional data. #### Background and Motivation 1. **Position Information in the Transformer Architecture**: - The attention mechanism in the Transformer model is invariant to the input order, so an additional mechanism is required to capture the position information of the input tokens. - Currently, commonly used position encoding methods include absolute position encoding, relative position encoding, and context position encoding. 2. **Limitations of Existing Methods**: - Rotary Position Embeddings (RoPE) is a successful position encoding method, especially suitable for one - dimensional sequence data (such as text). However, it performs poorly when dealing with higher - dimensional data (such as images and videos), resulting in a limited application range. - RoPE was originally designed to handle one - dimensional sequence data, which makes it less effective when dealing with two - or three - dimensional data, especially in data involving the time dimension (such as videos). #### Research Objectives The main objective of the paper is to develop a general - purpose position encoding scheme that can work effectively in different modalities (such as 2D images and 3D videos). Specifically: - **Expand the Application Range of RoPE**: By introducing LieRE, the position encoding method can support n - dimensional inputs, rather than being limited to one - dimensional sequence data. - **Improve Performance**: In 2D and 3D classification tasks, LieRE shows a significant performance improvement (for example, up to a 9.7% improvement in 2D tasks and up to a 25.5% improvement in 3D tasks). - **Improve Training Efficiency and Data Efficiency**: LieRE not only improves classification accuracy but also reduces the computational resources and data volume required for training. For example, in the CIFAR100 task, LieRE only needs 3.5 times fewer training steps to achieve the same accuracy, and can outperform the baseline model's performance with only 70% of the data. #### Main Contributions - **Introduce LieRE**: LieRE is a position encoding method based on Lie group theory, which can effectively learn how to utilize the relative spatial information of the input. - **Cross - Modal Applicability**: LieRE is not only applicable to 2D images but can also be applied to high - dimensional data such as 3D videos, demonstrating its wide applicability. - **Simplify Model Structure**: LieRE allows the use of a simpler general - purpose model backbone to handle multiple tasks, thereby simplifying the model design. Through these improvements, LieRE provides a more efficient and flexible position encoding scheme for processing multimodal data.

LieRE: Generalizing Rotary Position Encodings

DeepRING: Learning Roto-translation Invariant Representation for LiDAR Based Place Recognition.

Round and Round We Go! What makes Rotary Positional Encodings useful?

PoPE: Legendre Orthogonal Polynomials Based Position Encoding for Large Language Models

RINet: Efficient 3D Lidar-Based Place Recognition Using Rotation Invariant Neural Network

3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding

Rotary Position Embedding for Vision Transformer

Linearized Relative Positional Encoding

Rethinking and Improving Relative Position Encoding for Vision Transformer

Reinforcement Learning with Lie Group Orientations for Robotics

Auto-Encoding Transformations in Reparameterized Lie Groups for Unsupervised Learning.

Deep Projective Rotation Estimation through Relative Supervision

Conformer-based End-to-end Speech Recognition With Rotary Position Embedding

Improving Robustness and Accuracy via Relative Information Encoding in 3D Human Pose Estimation

RoFormer: Enhanced Transformer with Rotary Position Embedding

ROLE: Rotated Lorentzian Graph Embedding Model for Asymmetric Proximity.

HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning

Leveraging Positional Encoding for Robust Multi-Reference-Based Object 6D Pose Estimation

Explore Better Relative Position Embeddings from Encoding Perspective for Transformer Models.

Resonance RoPE: Improving Context Length Generalization of Large Language Models