Adaptive Multi-Scale Decomposition Framework for Time Series Forecasting

Yifan Hu,Peiyuan Liu,Peng Zhu,Dawei Cheng,Tao Dai
2024-06-06
Abstract:Transformer-based and MLP-based methods have emerged as leading approaches in time series forecasting (TSF). While Transformer-based methods excel in capturing long-range dependencies, they suffer from high computational complexities and tend to overfit. Conversely, MLP-based methods offer computational efficiency and adeptness in modeling temporal dynamics, but they struggle with capturing complex temporal patterns effectively. To address these challenges, we propose a novel MLP-based Adaptive Multi-Scale Decomposition (AMD) framework for TSF. Our framework decomposes time series into distinct temporal patterns at multiple scales, leveraging the Multi-Scale Decomposable Mixing (MDM) block to dissect and aggregate these patterns in a residual manner. Complemented by the Dual Dependency Interaction (DDI) block and the Adaptive Multi-predictor Synthesis (AMS) block, our approach effectively models both temporal and channel dependencies and utilizes autocorrelation to refine multi-scale data integration. Comprehensive experiments demonstrate that our AMD framework not only overcomes the limitations of existing methods but also consistently achieves state-of-the-art performance in both long-term and short-term forecasting tasks across various datasets, showcasing superior efficiency. Code is available at \url{<a class="link-external link-https" href="https://github.com/TROUBADOUR000/AMD" rel="external noopener nofollow">this https URL</a>}
Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the limitations of existing methods in time - series forecasting (TSF). Specifically: 1. **Problems with Transformer - based methods**: - **High computational complexity**: Due to the self - attention mechanism, the computational complexity of the Transformer model grows quadratically with the sequence length. - **Over - fitting problem**: When dealing with long sequences, the self - attention mechanism may weaken the temporal relationships, leading to over - emphasis on abrupt points and thus causing over - fitting. 2. **Problems with MLP - based methods**: - **Difficulty in capturing complex temporal patterns**: Although MLP - based methods perform well in terms of computational efficiency and modeling temporal dynamics, due to the simplicity of linear mapping, they have difficulty in effectively capturing complex spatio - temporal patterns, resulting in an information bottleneck and limiting the prediction accuracy. To address these problems, the authors propose an MLP - based Adaptive Multi - scale Decomposition (AMD) framework. This framework addresses the above problems in the following ways: - **Multi - scale decomposition**: Decompose the time series into multiple time patterns at different scales, and use the Multi - scale Decomposable Mixture (MDM) block to analyze and aggregate these patterns. - **Dual - dependency interaction**: Model the temporal and channel - dependency relationships simultaneously through the Dual - dependency Interaction (DDI) block. - **Adaptive multi - predictor synthesis**: Use the Adaptive Multi - predictor Synthesis (AMS) block to adaptively generate weights according to different time patterns and combine these patterns for prediction. Through these improvements, the AMD framework not only overcomes the limitations of existing methods but also achieves state - of - the - art performance in long - term and short - term prediction tasks on multiple datasets, demonstrating higher efficiency and accuracy. ### Formula summary 1. **Linear model prediction formula**: \[ \hat{Y} = XA \oplus b \in R^{C \times L} \] where \( \oplus \) represents the addition of column vectors. 2. **Multi - scale information transformation formula**: \[ g_i(x) = f_i(x) + g_{i + 1}(x)W_i \] where \( W_i \in R^{\left\lfloor \frac{L}{d^{i + 1}} \right\rfloor \times \left\lfloor \frac{L}{d^i} \right\rfloor} \) 3. **Selector weight calculation formula**: \[ S=\text{Softmax}(\text{TopK}(\text{Softmax}(Q(u)), k)) \] \[ Q(u)=\text{Decompose}(u)+\psi\cdot\text{Softplus}(\text{Decompose}(u)\cdot W_{\text{noise}}) \] where \( k \) is the number of main time patterns, \( \psi \sim N(0, 1) \), \( W_{\text{noise}} \in R^{m \times m} \) 4. **Loss function**: \[ L = L_{\text{pred}}+\lambda_1 L_{\text{selector}}+\lambda_2\|\Theta\|^2 \] where \( L_{\text{pred}}=\sum_{i = 0}^T\|y_i-\hat{y}_i\|^2_2 \), \( L_{\text{selector}}=\frac{\text{Var}(S)}{\text{Mean}(S)^2+\epsilon} \), and \( \epsilon \) is a small positive number to prevent numerical instability. Through these technical means.