WAVE: Weighted Autoregressive Varing Gate for Time Series Forecasting

Jiecheng Lu,Xu Han,Yan Sun,Shihao Yang
2025-02-07
Abstract:We propose a Weighted Autoregressive Varing gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve several key problems in Time Series Forecasting (TSF) tasks: 1. **Performance Improvement of AR Transformer**: - Firstly, the paper shows that through appropriate tokenization and training methods, using only the AR Transformer model can achieve performance comparable to the current best - performing baseline models in TSF tasks. - To further improve performance, the authors introduce the WA VE attention mechanism, enabling the pure - decoder AR Transformer to outperform the existing best - performing baseline models in TSF tasks. 2. **Combining ARMA Structure**: - Inspired by the ARMA model in statistics, the authors introduce the complete ARMA structure into the existing autoregressive attention mechanism. The ARMA model can handle both historical data and the cumulative impact of prediction errors simultaneously, thus better capturing long - term and short - term dependencies. - By introducing the Moving Average (MA) term, the ability of the AR attention mechanism to model local time patterns is enhanced without significantly increasing computational complexity or the number of parameters. 3. **Indirect Generation of MA Weights**: - A method for indirectly generating MA weights is proposed, avoiding explicit calculation of the MA attention matrix, thereby maintaining the efficiency of the linear attention mechanism. - Specific implicit MA weight generation techniques are designed to ensure proper decoupling and processing of short - term effects, allowing the AR part to focus on long - term and periodic patterns. 4. **Solving Long - and Short - Term Dependency Problems**: - In TSF tasks, data usually has stable periodicity and short - term impacts. Traditional exponential decay strategies are not effective in handling these characteristics, so a new method is required to decouple short - term impacts, enabling the AR part to better handle long - term dependencies. ### Formula Summary - **ARMA Structure Extension**: \[ v_{t + 1}=o_{\text{AR}}^{t}+o_{\text{MA}}^{t}+\epsilon_t \] where, \[ o_{\text{AR}}^{t}=\sum_{i = 1}^{t}w_{t,i}\odot v_i \] \[ o_{\text{MA}}^{t}=\sum_{j = 1}^{t - 1}\theta_{t - 1,j}\odot\epsilon_j \] \[ r_t=\sum_{j = 1}^{t - 1}\theta_{t - 1,j}\odot\epsilon_j+\epsilon_t \] - **Implicit MA Weight Generation**: \[ B=\Theta\cdot(I + \Theta)^{-1} \] \[ \Theta=B\cdot(I - B)^{-1} \] ### Experimental Results The experimental results show that the WA VE attention mechanism significantly improves the prediction performance on multiple public time - series datasets, especially in scenarios where long - and short - term dependencies are complex. Specifically, the WA VE Transformer has better average ranking and MSE on multiple datasets than other baseline models. ### Conclusion This paper successfully solves the performance bottleneck of AR Transformer in time - series prediction tasks by introducing the WA VE attention mechanism, and realizes efficient long - and short - term dependency modeling through combining the ARMA structure and the indirect MA weight generation method.