Extrapolatable Transformer Pre-training for Ultra Long Time-Series Forecasting

Ziyang Song,Qincheng Lu,Hao Xu,David L. Buckeridge,Yue Li
2024-02-15
Abstract:Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success in Natural Language Processing and Computer Vision domains. However, the development of PTMs on time-series data is lagging behind. This underscores the limitations of the existing transformer-based architectures, particularly their scalability to handle large-scale data and ability to capture long-term temporal dependencies. In this study, we present Timely Generative Pre-trained Transformer (TimelyGPT). TimelyGPT employs an extrapolatable position (xPos) embedding to encode trend and periodic patterns into time-series representations. It also integrates recurrent attention and temporal convolution modules to effectively capture global-local temporal dependencies. Our experiments show that TimelyGPT excels in modeling continuously monitored biosignals and irregularly-sampled time series data commonly observed in longitudinal electronic health records (EHRs). In ultra-long-term forecasting experiment, TimelyGPT achieves accurate extrapolation up to 6,000 timesteps of body temperature during the sleep stage transition given a short look-up window (i.e., prompt) containing only 2,000 timesteps. We further demonstrated TimelyGPT's forecasting capabilities on a preprocessed longitudinal healthcare administrative database called PopHR consisting of 489,000 patients randomly sampled from Montreal population. Together, we envision TimelyGPT to be useful in a broad spectrum of health domains including long-term patient health state forecasting and patient risk trajectory prediction.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the deficiencies of the existing Transformer architectures in processing large - scale time - series data and capturing long - term time - dependencies. Specifically, the paper focuses on the following points: 1. **Processing of large - scale time - series data**: The existing Transformer models have scalability issues when processing large - scale time - series data and are difficult to effectively handle long - sequence data. 2. **Capturing long - term time - dependencies**: Traditional Transformer models perform poorly in capturing long - term dependency relationships in time - series, especially when ultra - long - term prediction is required. To solve these problems, the authors propose the Timely Generative Pre - trained Transformer (TimelyGPT), aiming to improve the pre - training method for time - series data and enhance the ability of ultra - long - term prediction. The following are the main innovation points of TimelyGPT: 1. **Extrapolatable position encoding (xPos)**: - TimelyGPT introduces an extrapolatable position embedding, called xPos, to encode trend and periodic patterns in time - series. This encoding method can effectively capture long - term dependencies in time - series and support ultra - long - term prediction. - xPos encodes relative distance information through rotation matrices and exponential decay, thereby better capturing long - term trends and periodic features in time - series. The formulas are as follows: \[ \tilde{Q}_n \tilde{K}_m = X_n W_Q (\gamma e^{i\theta})^{n - m} X_m W_K = \gamma^{n - m} \hat{Q}_n \hat{K}_m \] \[ \hat{Q}_n = X_n W_Q e^{i\theta_n}, \quad \hat{K}_m = X_m W_K e^{-i\theta_m} \] 2. **Recurrent attention mechanism (Retention)**: - TimelyGPT integrates a recurrent attention mechanism (Retention), which can effectively process continuously and irregularly sampled time - series data. The Retention mechanism naturally models time - series data in the form of an RNN and can capture sequence dependencies with a constant inference complexity. - For irregularly sampled time - series, the Retention mechanism adapts to data points with different time intervals by adjusting the decay matrix. The formula is as follows: \[ Ret(X) = (Q K^\top \odot D) V, \quad D_{nm} = \begin{cases} \gamma^{\Delta t_{n,m}}, & n \geq m \\ 0, & n < m \end{cases} \] 3. **Temporal convolution module**: - TimelyGPT also introduces a temporal convolution module to extract local features in time - series. Through depth - wise separable convolution, this module can extract multi - scale features in multiple decoding layers, thereby enhancing the learning ability of global and local feature interactions. 4. **Large - scale pre - training**: - TimelyGPT uses large - scale unlabeled time - series data for pre - training and adopts the "Next - Token Prediction" task to learn time - series representations. During the pre - training process, the model learns time - dependencies by right - shifting the input sequence and predicting subsequent tokens. Experimental results show that TimelyGPT performs excellently in ultra - long - term prediction tasks, especially when processing irregularly sampled time - series data in biosignals and electronic health records (EHR), having significant advantages.