Abstract:Transformers are now ubiquitous for sequence modeling tasks, but their extension to multi-dimensional data remains a challenge due to the quadratic cost of the attention mechanism. In this paper, we propose Higher-Order Transformers (HOT), a novel architecture designed to efficiently process data with more than two axes, i.e. higher-order tensors. To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism that reduces the attention cost to quadratic in each axis' dimension, rather than quadratic in the total size of the input tensor. To further enhance efficiency, HOT leverages kernelized attention, reducing the complexity to linear. This strategy maintains the model's expressiveness while enabling scalable attention computation. We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification. Experimental results demonstrate that HOT achieves competitive performance while significantly improving computational efficiency, showcasing its potential for tackling a wide range of complex, multi-dimensional data.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the computational and memory challenges faced by existing Transformer models when processing high - dimensional tensor data. Specifically, the traditional self - attention mechanism has the problem of quadratic computational complexity when dealing with multi - dimensional data (such as videos, multi - dimensional time series, and 3D medical images), which limits its application in high - dimensional input tasks. To solve this problem, the authors propose Higher - Order Transformers (HOT), a new architecture aimed at efficiently processing data with multiple axes (i.e., high - order tensors). The key innovations of HOT include: 1. **Higher - order attention mechanism of Kronecker decomposition**: By decomposing the high - order attention matrix into the form of a Kronecker product, the computational complexity is significantly reduced. The specific formula is: \[ S_h \approx S^{(1)}_h \otimes S^{(2)}_h \otimes \dots \otimes S^{(k)}_h \] where each \( S^{(i)}_h \in \mathbb{R}^{N_i \times N_i} \) is the attention weight matrix corresponding to the \( i \) - th dimension. 2. **Kernelized linear attention mechanism**: Further introduce the kernelized linear attention mechanism, reducing the complexity from quadratic to linear, thereby improving the computational efficiency. The specific formula is: \[ S^{(i)}_h \approx (Z^{(i)}_h)^{-1} \phi(\tilde{Q}^{(i)}_h) \phi(\tilde{K}^{(i)}_h)^\top \] where \( Z^{(i)}_h \in \mathbb{R}^{N_i \times N_i} \) is a diagonal normalization matrix, and \( \phi \) is a kernel feature mapping function. Through these improvements, HOT can not only greatly improve the computational efficiency while maintaining the expressive power of the model, but also achieve excellent performance on complex multi - dimensional data tasks. Experimental results show that HOT performs well in tasks such as long - term time series prediction and 3D medical image classification, while significantly reducing the computational complexity.

Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data

Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers

HOT: Higher-Order Dynamic Graph Representation Learning with Efficient Transformers

Axial Attention in Multidimensional Transformers

DARKER: Efficient Transformer with Data-Driven Attention Mechanism for Time Series

Adaptive Multi-Resolution Attention with Linear Complexity

Representational Strengths and Limitations of Transformers

Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization

FAST: Factorizable Attention for Speeding up Transformers

Improving Transformers with Dynamically Composable Multi-Head Attention

Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning

Generalized Probabilistic Attention Mechanism in Transformers

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

Efficient High-Resolution Time Series Classification via Attention Kronecker Decomposition

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

Jump Self-attention: Capturing High-order Statistics in Transformers

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

The Power of Hard Attention Transformers on Data Sequences: A Formal Language Theoretic Perspective

Transformer Acceleration with Dynamic Sparse Attention

How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation

Vision Transformers with Hierarchical Attention