TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting

Vijay Ekambaram,Arindam Jati,Nam Nguyen,Phanwadee Sinthong,Jayant Kalagnanam
DOI: https://doi.org/10.1145/3580305.3599533
2023-12-11
Abstract:Transformers have gained popularity in time series forecasting for their ability to capture long-sequence interactions. However, their high memory and computing requirements pose a critical bottleneck for long-term forecasting. To address this, we propose TSMixer, a lightweight neural architecture exclusively composed of multi-layer perceptron (MLP) modules for multivariate forecasting and representation learning on patched time series. Inspired by MLP-Mixer's success in computer vision, we adapt it for time series, addressing challenges and introducing validated components for enhanced accuracy. This includes a novel design paradigm of attaching online reconciliation heads to the MLP-Mixer backbone, for explicitly modeling the time-series properties such as hierarchy and channel-correlations. We also propose a novel Hybrid channel modeling and infusion of a simple gating approach to effectively handle noisy channel interactions and generalization across diverse datasets. By incorporating these lightweight components, we significantly enhance the learning capability of simple MLP structures, outperforming complex Transformer models with minimal computing usage. Moreover, TSMixer's modular design enables compatibility with both supervised and masked self-supervised learning methods, making it a promising building block for time-series Foundation Models. TSMixer outperforms state-of-the-art MLP and Transformer models in forecasting by a considerable margin of 8-60%. It also outperforms the latest strong benchmarks of Patch-Transformer models (by 1-2%) with a significant reduction in memory and runtime (2-3X). The source code of our model is officially released as PatchTSMixer in the HuggingFace. Model: <a class="link-external link-https" href="https://huggingface.co/docs/transformers/main/en/model_doc/patchtsmixer" rel="external noopener nofollow">this https URL</a> Examples: <a class="link-external link-https" href="https://github.com/ibm/tsfm/#notebooks-links" rel="external noopener nofollow">this https URL</a>
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address several key issues in multivariate time series forecasting: 1. **Computational Bottleneck in Long Sequence Forecasting**: - Although Transformers excel at capturing long sequence dependencies, their memory and computational demands pose severe bottlenecks for long-term forecasting. Despite various computational optimizations in self-attention modules, this issue remains fundamentally unresolved. 2. **Insufficient Modeling of Time Series Characteristics**: - The success of Transformers in natural language processing (NLP) has not fully transferred to the time series domain. While positional embeddings retain some sequential information, the self-attention mechanism is inherently permutation-invariant, leading to the loss of temporal information. Additionally, individual time points lack significant semantic information and can be easily inferred from neighboring points, resulting in a substantial waste of modeling capacity on point-by-point details. 3. **Modeling Inter-Channel Correlations**: - Existing patch mixing methods (e.g., PatchTST) typically adopt a purely channel-independent approach, failing to explicitly capture inter-channel correlations. This leads to noisy interactions between channels in the early layers, making it difficult to disentangle these channels at the output stage. ### Solution Overview To address the aforementioned issues, the authors propose TSMixer, a lightweight neural network architecture based on multilayer perceptrons (MLPs), specifically designed for multivariate time series forecasting. The main features of TSMixer include: 1. **Patch Division and Modular Architecture**: - Similar to PatchTST, TSMixer employs a patch division approach and follows a modular architecture. It learns a common "backbone" to capture the temporal dynamics of the data and then attaches and fine-tunes different "heads" for various downstream tasks (e.g., forecasting). 2. **Online Reconciliation Head**: - A novel design paradigm is proposed, attaching and adjusting an online reconciliation head on the MLP-Mixer backbone to explicitly model the hierarchical structure and inter-channel correlations of time series. Specifically, TSMixer introduces two new online reconciliation heads that leverage the inherent characteristics of time series (e.g., hierarchical aggregation and channel correlation) to improve forecasting results. 3. **Hybrid Channel Modeling**: - A novel "hybrid" channel modeling approach is adopted, enhancing a channel-independent backbone with cross-channel reconciliation heads to effectively handle channel interaction noise across different datasets. 4. **Gated Attention Mechanism**: - A simple gated attention mechanism is introduced to guide the model to focus on important features, effectively modeling long sequence interactions without the need for complex multi-head self-attention blocks. By incorporating these lightweight components, TSMixer significantly enhances the learning capability of a simple MLP structure, surpassing complex Transformer models while substantially reducing computational resource usage. Experimental results show that TSMixer achieves significant performance improvements across multiple benchmark datasets.