Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data

Soroush Omranpour,Guillaume Rabusseau,Reihaneh Rabbany
2024-12-04
Abstract:Transformers are now ubiquitous for sequence modeling tasks, but their extension to multi-dimensional data remains a challenge due to the quadratic cost of the attention mechanism. In this paper, we propose Higher-Order Transformers (HOT), a novel architecture designed to efficiently process data with more than two axes, i.e. higher-order tensors. To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism that reduces the attention cost to quadratic in each axis' dimension, rather than quadratic in the total size of the input tensor. To further enhance efficiency, HOT leverages kernelized attention, reducing the complexity to linear. This strategy maintains the model's expressiveness while enabling scalable attention computation. We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification. Experimental results demonstrate that HOT achieves competitive performance while significantly improving computational efficiency, showcasing its potential for tackling a wide range of complex, multi-dimensional data.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the computational and memory challenges faced by existing Transformer models when processing high - dimensional tensor data. Specifically, the traditional self - attention mechanism has the problem of quadratic computational complexity when dealing with multi - dimensional data (such as videos, multi - dimensional time series, and 3D medical images), which limits its application in high - dimensional input tasks. To solve this problem, the authors propose Higher - Order Transformers (HOT), a new architecture aimed at efficiently processing data with multiple axes (i.e., high - order tensors). The key innovations of HOT include: 1. **Higher - order attention mechanism of Kronecker decomposition**: By decomposing the high - order attention matrix into the form of a Kronecker product, the computational complexity is significantly reduced. The specific formula is: \[ S_h \approx S^{(1)}_h \otimes S^{(2)}_h \otimes \dots \otimes S^{(k)}_h \] where each \( S^{(i)}_h \in \mathbb{R}^{N_i \times N_i} \) is the attention weight matrix corresponding to the \( i \) - th dimension. 2. **Kernelized linear attention mechanism**: Further introduce the kernelized linear attention mechanism, reducing the complexity from quadratic to linear, thereby improving the computational efficiency. The specific formula is: \[ S^{(i)}_h \approx (Z^{(i)}_h)^{-1} \phi(\tilde{Q}^{(i)}_h) \phi(\tilde{K}^{(i)}_h)^\top \] where \( Z^{(i)}_h \in \mathbb{R}^{N_i \times N_i} \) is a diagonal normalization matrix, and \( \phi \) is a kernel feature mapping function. Through these improvements, HOT can not only greatly improve the computational efficiency while maintaining the expressive power of the model, but also achieve excellent performance on complex multi - dimensional data tasks. Experimental results show that HOT performs well in tasks such as long - term time series prediction and 3D medical image classification, while significantly reducing the computational complexity.