Abstract:Transformer-based models have traditionally been the primary focus of research for addressing time series forecasting challenges. However, the emergence of recently introduced high-performance linear models has cast doubt upon the effectiveness of transformer architecture in time series forecasting tasks. Throughout, most Transformer variants have represented time series using time point-wise tokenization, which does not provide sufficient semantic information for the attention mechanism. PatchTST expands the receptive field through patch-wise tokenization, mitigating the problem of inadequate information. However, when confronted with multivariate time series forecasting tasks, it does not consider the potential impact of delays and correlation between variates on prediction performance. The recently proposed iTransformer addresses the issue of misalignment between variates by employing series-wise tokenization, yet its embedding method is limited to shallow temporal feature representation. In this work, we propose the Temporal Feature Enhanced Transformer (TFEformer), which deeply integrates patch-wise and series-wise tokenization to enhance the temporal representation of multivariate tokens. Furthermore, we introduce a multi-scale patch fusion mechanism capable of capturing and adaptively integrating temporal features across multiple resolutions. We also enhanced the FFN module to serve as a temporal feature extractor and introduced variate-wise attention to capture the correlations between variables. Extensive experiments on eight real-world datasets have demonstrated that TFEformer outperforms all existing models, achieving state-of-the-art performance. Through experiments, we have also shown that TFEformer improves transformer-based models with superior generalization ability, better utilization of extended lookback windows, and effective suppression of distribution shifts.

DTA: Deformable Temporal Attention for Video Recognition

TFEformer: Temporal Feature Enhanced Transformer for Multivariate Time Series Forecasting

Towards Robust Video Instance Segmentation with Temporal-Aware Transformer

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

Effective and Robust: A Discriminative Temporal Learning Transformer for Satellite Videos

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Multi-Scale Temporal Difference Transformer for Video-Text Retrieval

Dilated Transformer with Feature Aggregation Module for Action Segmentation

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Temporal Deformable Transformer for Action Localization

Time Is MattEr: Temporal Self-supervision for Video Transformers

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

Space or time for video classification transformers

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

Long-Term Pre-training for Temporal Action Detection with Transformers

Temporal Transformer Networks with Self-Supervision for Action Recognition.

Do We Really Need Temporal Convolutions in Action Segmentation?

Efficient Video Transformers with Spatial-Temporal Token Selection

TDN: Temporal Difference Networks for Efficient Action Recognition

Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos