Abstract:Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications. The code will be made publicly available at <a class="link-external link-https" href="https://github.com/lsj2408/URPE" rel="external noopener nofollow">this https URL</a>.

Bidirectional Transformer with Absolute-Position Aware Relative Position Encoding for Encoding Sentences

Improve Transformer Models with Better Relative Position Embeddings

Design of a Modified Transformer Architecture Based on Relative Position Coding

An Augmented Transformer Architecture for Natural Language Generation Tasks

Relative Positional Encoding Family via Unitary Transformation

Complex-Valued Relative Positional Encodings for Transformer

MS-Transformer: Introduce multiple structural priors into a unified transformer for encoding sentences

NON-AUTOREGRESSIVE TRANSFORMER WITH UNIFIED BIDIRECTIONAL DECODER FOR AUTOMATIC SPEECH RECOGNITION

A Bidirectional Context Embedding Transformer for Automatic Speech Recognition

Rethinking and Improving Relative Position Encoding for Vision Transformer

RoFormer: Enhanced Transformer with Rotary Position Embedding

An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding

Transformer with Bidirectional Decoder for Speech Recognition

Linearized Relative Positional Encoding

A Simple and Effective Positional Encoding for Transformers

What Should Be Encoded by Position Embedding for Neural Network Language Models?

Modeling Graph Structure in Transformer for Better AMR-to-Text Generation.

Transformer-Based End-to-End Speech Translation With Rotary Position Embedding

Your Transformer May Not be as Powerful as You Expect

Explore Better Relative Position Embeddings from Encoding Perspective for Transformer Models.