Abstract:To capture user preference, transformer models have been widely applied to model sequential user behavior data. The core of transformer architecture lies in the self-attention mechanism, which computes the pairwise attention scores in a sequence. Due to the permutation-equivariant nature, positional encoding is used to enhance the attention between token representations. In this setting, the pairwise attention scores can be derived by both semantic difference and positional difference. However, prior studies often model the two kinds of difference measurements in different ways, which potentially limits the expressive capacity of sequence modeling. To address this issue, this paper proposes a novel transformer variant with complex vector attention, named EulerFormer, which provides a unified theoretical framework to formulate both semantic difference and positional difference. The EulerFormer involves two key technical improvements. First, it employs a new transformation function for efficiently transforming the sequence tokens into polar-form complex vectors using Euler's formula, enabling the unified modeling of both semantic and positional information in a complex rotation form.Secondly, it develops a differential rotation mechanism, where the semantic rotation angles can be controlled by an adaptation function, enabling the adaptive integration of the semantic and positional information according to the semantic contexts.Furthermore, a phase contrastive learning task is proposed to improve the isotropy of contextual representations in EulerFormer. Our theoretical framework possesses a high degree of completeness and generality. It is more robust to semantic variations and possesses moresuperior theoretical properties in principle. Extensive experiments conducted on four public datasets demonstrate the effectiveness and efficiency of our approach.

Rethinking Position Embedding Methods in the Transformer Architecture

Improve Transformer Models with Better Relative Position Embeddings

A Simple and Effective Positional Encoding for Transformers

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding

A bio-inspired positional embedding network for transformer-based models

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Efficient transformer with reinforced position embedding for language models

Rethinking Positional Encoding in Language Pre-training

Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

Rethinking and Improving Relative Position Encoding for Vision Transformer

A Frustratingly Easy Improvement for Position Embeddings Via Random Padding

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention

Predictive Attention Transformer: Improving Transformer with Attention Map Prediction

Value Residual Learning For Alleviating Attention Concentration In Transformers

Triplet Attention: Rethinking the similarity in Transformers

An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement

Entangled Transformer for Image Captioning

Attention Is Not All You Need Anymore

Position Embedding Needs an Independent Layer Normalization

An Intrinsic Dimension Perspective of Transformers for Sequential Modeling