Abstract:This paper explores how to enhance existing masked time-series modeling by randomly dropping sub-sequence level patches of time series. On this basis, a simple yet effective method named DropPatch is proposed, which has two remarkable advantages: 1) It improves the pre-training efficiency by a square-level advantage; 2) It provides additional advantages for modeling in scenarios such as in-domain, cross-domain, few-shot learning and cold start. This paper conducts comprehensive experiments to verify the effectiveness of the method and analyze its internal mechanism. Empirically, DropPatch strengthens the attention mechanism, reduces information redundancy and serves as an efficient means of data augmentation. Theoretically, it is proved that DropPatch slows down the rate at which the Transformer representations collapse into the rank-1 linear subspace by randomly dropping patches, thus optimizing the quality of the learned representations

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Existing mask - based time - series modeling methods (such as PatchTST) have limitations in learning useful features. The specific manifestations are as follows: 1. **Low mask ratio leads to shallow learning**: A lower mask ratio cannot effectively learn the deep - level features in the time series, causing the model to only recover the surface patterns, thus resulting in the over - fitting problem. 2. **High mask ratio leads to distracted attention**: A higher mask ratio will dilute the attention mechanism, making it difficult to focus on the relevant and important parts of the data, thereby reducing the performance of downstream tasks. To solve these problems, the author proposes a simple and effective strategy - DropPatch. This strategy enhances the effect of existing masked time - series modeling in the pre - training stage by randomly discarding subsequence - level fragments (patches) in the time series. ### Main advantages of DropPatch: - **Improve pre - training efficiency**: By reducing the number of patches to be processed, it significantly improves computational efficiency and reduces memory consumption. - **Enhance the attention mechanism**: It enables the attention mechanism to focus more on multi - scale and diverse information, thereby capturing more critical patterns. - **Reduce information redundancy**: By randomly discarding patches, it reduces the redundancy in the representation and optimizes the quality of the learned representation. ### Theoretical and empirical support: - **Theoretical analysis**: It is proved that DropPatch can slow down the speed at which the Transformer representation converges to the rank - 1 linear subspace by randomly discarding patches, thereby promoting feature diversity. - **Experimental evidence**: The effectiveness of DropPatch in different scenarios, including in - domain, cross - domain, few - sample learning, and cold - start tasks, has been verified through a large number of experiments. ### Summary: The paper proposes a new pre - training strategy, DropPatch, which aims to overcome the limitations of existing mask - based time - series modeling methods by randomly discarding subsequence fragments in the time series. This method not only improves pre - training efficiency but also shows a significant performance improvement in multiple downstream tasks.

Enhancing Masked Time-Series Modeling via Dropping Patches

TFEformer: Temporal Feature Enhanced Transformer for Multivariate Time Series Forecasting

Learning to Embed Time Series Patches Independently

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

ShadowMaskFormer: Mask Augmented Patch Embeddings for Shadow Removal

Bootstrap Masked Visual Modeling via Hard Patches Mining

Masking Augmentation for Supervised Learning

TLM: Token-Level Masking for Transformers

Masked Autoencoders for Point Cloud Self-supervised Learning.

TimeMAE: Self-Supervised Representations of Time Series with Decoupled Masked Autoencoders

UniDrop: A Simple Yet Effective Technique to Improve Transformer Without Extra Cost.

Fast Training of Diffusion Models with Masked Transformers

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

PointDrop: Improving Object Detection from Sparse Point Clouds Via Adversarial Data Augmentation

Dateformer: Time-modeling Transformer for Longer-term Series Forecasting

Learning Pattern-Specific Experts for Time Series Forecasting Under Patch-level Distribution Shift

Revisiting Token Dropping Strategy in Efficient BERT Pretraining

Dropframe Scheme in Recurrent Neural Networks for Time Series Modeling

Point-MPP: Point Cloud Self-Supervised Learning from Masked Position Prediction

HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting

Randomness Regularization with Simple Consistency Training for Neural Networks