Abstract:Transformers have gained popularity in time series forecasting for their ability to capture long-sequence interactions. However, their high memory and computing requirements pose a critical bottleneck for long-term forecasting. To address this, we propose TSMixer, a lightweight neural architecture exclusively composed of multi-layer perceptron (MLP) modules for multivariate forecasting and representation learning on patched time series. Inspired by MLP-Mixer's success in computer vision, we adapt it for time series, addressing challenges and introducing validated components for enhanced accuracy. This includes a novel design paradigm of attaching online reconciliation heads to the MLP-Mixer backbone, for explicitly modeling the time-series properties such as hierarchy and channel-correlations. We also propose a novel Hybrid channel modeling and infusion of a simple gating approach to effectively handle noisy channel interactions and generalization across diverse datasets. By incorporating these lightweight components, we significantly enhance the learning capability of simple MLP structures, outperforming complex Transformer models with minimal computing usage. Moreover, TSMixer's modular design enables compatibility with both supervised and masked self-supervised learning methods, making it a promising building block for time-series Foundation Models. TSMixer outperforms state-of-the-art MLP and Transformer models in forecasting by a considerable margin of 8-60%. It also outperforms the latest strong benchmarks of Patch-Transformer models (by 1-2%) with a significant reduction in memory and runtime (2-3X). The source code of our model is officially released as PatchTSMixer in the HuggingFace. Model: <a class="link-external link-https" href="https://huggingface.co/docs/transformers/main/en/model_doc/patchtsmixer" rel="external noopener nofollow">this https URL</a> Examples: <a class="link-external link-https" href="https://github.com/ibm/tsfm/#notebooks-links" rel="external noopener nofollow">this https URL</a>

Masked Mixers for Language Generation and Retrieval

Hierarchical Associative Memory, Parallelized MLP-Mixer, and Symmetry Breaking

TS‐Mixer: A lightweight text representation model based on context awareness

MTS-Mixers: Multivariate Time Series Forecasting via Factorized Temporal and Channel Mixing

HyperMixer: An MLP-based Low Cost Alternative to Transformers

AMixer: Adaptive Weight Mixing for Self-attention Free Vision Transformers.

StableMask: Refining Causal Masking in Decoder-only Transformer

What to Hide from Your Students: Attention-Guided Masked Image Modeling

AMPLIFY: attention-based mixup for performance improvement and label smoothing in transformer

Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer

Beyond Intuition: Rethinking Token Attributions Inside Transformers

Mask Transformer: Unpaired Text Style Transfer Based on Masked Language

Transformer Vs. MLP-Mixer: Exponential Expressive Gap For NLP Problems

AMPLIFY:Attention-based Mixup for Performance Improvement and Label Smoothing in Transformer

TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting

Isotropy-Enhanced Conditional Masked Language Models

ASMix: an Attention-based Smooth Data Augmentation Approach.

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

GroupedMixer: An Entropy Model with Group-wise Token-Mixers for Learned Image Compression

Mixhead: Breaking the Low-Rank Bottleneck in Multi-Head Attention Language Models

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning