Abstract:We propose a Weighted Autoregressive Varing gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in Time Series Forecasting (TSF) tasks: 1. **Performance Improvement of AR Transformer**: - Firstly, the paper shows that through appropriate tokenization and training methods, using only the AR Transformer model can achieve performance comparable to the current best - performing baseline models in TSF tasks. - To further improve performance, the authors introduce the WA VE attention mechanism, enabling the pure - decoder AR Transformer to outperform the existing best - performing baseline models in TSF tasks. 2. **Combining ARMA Structure**: - Inspired by the ARMA model in statistics, the authors introduce the complete ARMA structure into the existing autoregressive attention mechanism. The ARMA model can handle both historical data and the cumulative impact of prediction errors simultaneously, thus better capturing long - term and short - term dependencies. - By introducing the Moving Average (MA) term, the ability of the AR attention mechanism to model local time patterns is enhanced without significantly increasing computational complexity or the number of parameters. 3. **Indirect Generation of MA Weights**: - A method for indirectly generating MA weights is proposed, avoiding explicit calculation of the MA attention matrix, thereby maintaining the efficiency of the linear attention mechanism. - Specific implicit MA weight generation techniques are designed to ensure proper decoupling and processing of short - term effects, allowing the AR part to focus on long - term and periodic patterns. 4. **Solving Long - and Short - Term Dependency Problems**: - In TSF tasks, data usually has stable periodicity and short - term impacts. Traditional exponential decay strategies are not effective in handling these characteristics, so a new method is required to decouple short - term impacts, enabling the AR part to better handle long - term dependencies. ### Formula Summary - **ARMA Structure Extension**: \[ v_{t + 1}=o_{\text{AR}}^{t}+o_{\text{MA}}^{t}+\epsilon_t \] where, \[ o_{\text{AR}}^{t}=\sum_{i = 1}^{t}w_{t,i}\odot v_i \] \[ o_{\text{MA}}^{t}=\sum_{j = 1}^{t - 1}\theta_{t - 1,j}\odot\epsilon_j \] \[ r_t=\sum_{j = 1}^{t - 1}\theta_{t - 1,j}\odot\epsilon_j+\epsilon_t \] - **Implicit MA Weight Generation**: \[ B=\Theta\cdot(I + \Theta)^{-1} \] \[ \Theta=B\cdot(I - B)^{-1} \] ### Experimental Results The experimental results show that the WA VE attention mechanism significantly improves the prediction performance on multiple public time - series datasets, especially in scenarios where long - and short - term dependencies are complex. Specifically, the WA VE Transformer has better average ranking and MSE on multiple datasets than other baseline models. ### Conclusion This paper successfully solves the performance bottleneck of AR Transformer in time - series prediction tasks by introducing the WA VE attention mechanism, and realizes efficient long - and short - term dependency modeling through combining the ARMA structure and the indirect MA weight generation method.

WAVE: Weighted Autoregressive Varing Gate for Time Series Forecasting

WaveRoRA: Wavelet Rotary Route Attention for Multivariate Time Series Forecasting

Temporal Conditional VAE for Distributional Drift Adaptation in Multivariate Time Series

Attention as Robust Representation for Time Series Forecasting

A hybrid framework for multivariate long-sequence time series forecasting

CrossWaveNet: A dual-channel network with deep cross-decomposition for Long-term Time Series Forecasting

Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations

Distributional Drift Adaptation With Temporal Conditional Variational Autoencoder for Multivariate Time Series Forecasting

Revisiting Attention for Multivariate Time Series Forecasting

Temporal pattern attention for multivariate time series forecasting

Enhancing Neural Network Based Hybrid Learning with Empirical Wavelet Transform for Time Series Forecasting

WEITS: A Wavelet-enhanced residual framework for interpretable time series forecasting

AD-autoformer: decomposition transformers with attention distilling for long sequence time-series forecasting

DBAFormer: A Double-Branch Attention Transformer for Long-Term Time Series Forecasting

WaveForM: Graph Enhanced Wavelet Learning for Long Sequence Forecasting of Multivariate Time Series

TS2ARCformer: A Multi-Dimensional Time Series Forecasting Framework for Short-Term Load Prediction

MultiWave: Multiresolution Deep Architectures through Wavelet Decomposition for Multivariate Time Series Prediction

Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization

W-FENet: Wavelet-based Fourier-Enhanced Network Model Decomposition for Multivariate Long-Term Time-Series Forecasting

Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition