Abstract:Machine learning architectures, including transformers and recurrent neural networks (RNNs) have revolutionized forecasting in applications ranging from text processing to extreme weather. Notably, advanced network architectures, tuned for applications such as natural language processing, are transferable to other tasks such as spatiotemporal forecasting tasks. However, there is a scarcity of ablation studies to illustrate the key components that enable this forecasting accuracy. The absence of such studies, although explainable due to the associated computational cost, intensifies the belief that these models ought to be considered as black boxes. In this work, we decompose the key architectural components of the most powerful neural architectures, namely gating and recurrence in RNNs, and attention mechanisms in transformers. Then, we synthesize and build novel hybrid architectures from the standard blocks, performing ablation studies to identify which mechanisms are effective for each task. The importance of considering these components as hyper-parameters that can augment the standard architectures is exhibited on various forecasting datasets, from the spatiotemporal chaotic dynamics of the multiscale Lorenz 96 system, the Kuramoto-Sivashinsky equation, as well as standard real world time-series benchmarks. A key finding is that neural gating and attention improves the performance of all standard RNNs in most tasks, while the addition of a notion of recurrence in transformers is detrimental. Furthermore, our study reveals that a novel, sparsely used, architecture which integrates Recurrent Highway Networks with neural gating and attention mechanisms, emerges as the best performing architecture in high-dimensional spatiotemporal forecasting of dynamical systems.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore and solve the following problems: 1. **Model Transferability**: Research on how to effectively transfer Transformer and gated Recurrent Neural Networks (gated RNNs) originally designed for tasks such as Natural Language Processing (NLP) to dynamic system prediction tasks. Although these models perform excellently in other fields, their performance and applicability in dynamic system prediction still need further verification. 2. **Understanding of Key Mechanisms**: Through ablation studies, decompose and understand the roles of the key components of Transformer and gated RNNs - gating, attention, and recurrence mechanisms. This helps to reveal the effectiveness of these mechanisms for different tasks and explain why certain components perform better in specific tasks. 3. **Exploration of New Architectures**: Based on the above understanding, construct new hybrid architectures and evaluate the performance of these new architectures on different types of datasets. Specifically, the paper attempts to combine different gating and attention mechanisms to create more effective prediction models. 4. **Avoiding the "Black - Box" Phenomenon**: Through detailed analysis and experiments, reduce the "black - box" nature of deep - learning models. Through in - depth research on the internal mechanisms of the models, make these complex models more transparent and controllable. ### Main Findings - **Improvement of Gating and Attention Mechanisms**: Research shows that in most tasks, adding neural gating and attention mechanisms can significantly improve the performance of standard RNNs. - **Influence of Recurrence Mechanisms**: Introducing recurrence mechanisms into Transformer will instead reduce its performance. - **Optimal Architecture**: A new architecture that combines Recurrent Highway Networks (RHNs), neural gating, and attention mechanisms performs best in the prediction of high - dimensional spatio - temporal dynamic systems. ### Application Scenarios The paper has been verified through multiple datasets, including: - Spatio - temporal chaotic dynamics of the multi - scale Lorenz 96 system - Kuramoto - Sivashinsky equation - Standard time - series benchmark datasets These results not only show the potential of the new architecture in predicting complex dynamic systems but also provide important references and directions for future research.

Deconstructing Recurrence, Attention, and Gating: Investigating the transferability of Transformers and Gated Recurrent Neural Networks in forecasting of dynamical systems

Enhancing Time Series Forecasting: A Hierarchical Transformer with Probabilistic Decomposition Representation

LiteTransNet: an Interpretable Approach for Landslide Displacement Prediction Using Transformer Model with Attention Mechanism

Deep transition network with gating mechanism for multivariate time series forecasting

Two Steps Forward and One Behind: Rethinking Time Series Forecasting with Deep Learning

Backpropagation algorithms and Reservoir Computing in Recurrent Neural Networks for the forecasting of complex spatiotemporal dynamics

NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting

A multi-head attention-based transformer model for traffic flow forecasting with a comparative analysis to recurrent neural networks

Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

ReCycle: Fast and Efficient Long Time Series Forecasting with Residual Cyclic Transformers

Itransformer: Inverted Transformers Are Effective for Time Series Forecasting

DRFormer: Multi-Scale Transformer Utilizing Diverse Receptive Fields for Long Time-Series Forecasting

DRCNN: decomposing residual convolutional neural networks for time series forecasting

Transfer Learning on Transformers for Building Energy Consumption Forecasting -- A Comparative Study

Transformer-Based Model for Electrical Load Forecasting

Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting

STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting

N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting

DyGraphformer: Transformer combining dynamic spatio-temporal graph network for multivariate time series forecasting

A hybrid framework for multivariate long-sequence time series forecasting