Abstract:To capture user preference, transformer models have been widely applied to model sequential user behavior data. The core of transformer architecture lies in the self-attention mechanism, which computes the pairwise attention scores in a sequence. Due to the permutation-equivariant nature, positional encoding is used to enhance the attention between token representations. In this setting, the pairwise attention scores can be derived by both semantic difference and positional difference. However, prior studies often model the two kinds of difference measurements in different ways, which potentially limits the expressive capacity of sequence modeling. To address this issue, this paper proposes a novel transformer variant with complex vector attention, named EulerFormer, which provides a unified theoretical framework to formulate both semantic difference and positional difference. The EulerFormer involves two key technical improvements. First, it employs a new transformation function for efficiently transforming the sequence tokens into polar-form complex vectors using Euler's formula, enabling the unified modeling of both semantic and positional information in a complex rotation form.Secondly, it develops a differential rotation mechanism, where the semantic rotation angles can be controlled by an adaptation function, enabling the adaptive integration of the semantic and positional information according to the semantic contexts.Furthermore, a phase contrastive learning task is proposed to improve the isotropy of contextual representations in EulerFormer. Our theoretical framework possesses a high degree of completeness and generality. It is more robust to semantic variations and possesses moresuperior theoretical properties in principle. Extensive experiments conducted on four public datasets demonstrate the effectiveness and efficiency of our approach.

Leveraging Relaxed Equilibrium by Lazy Transition for Sequence Modeling

Relaxed Attention for Transformer Models

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling

EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention

Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models

Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Efficient Long Sequence Modeling Via State Space Augmented Transformer

Attention is All you Need

Token-Level Self-Evolution Training for Sequence-to-Sequence Learning

Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

Structure-aware Fine-tuning of Sequence-to-sequence Transformers for Transition-based AMR Parsing

Reinforcement learning under temporal logic constraints as a sequence modelling problem

Joint tokenization, parsing, and translation

T-REG: Preference Optimization with Token-Level Reward Regularization

Sequence Generation with Mixed Representations.

Luna: Linear Unified Nested Attention

TLM: Token-Level Masking for Transformers

LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction

Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement