Abstract:This technical report describes the Time Series Optimized Transformer for Observability (Toto), a new state of the art foundation model for time series forecasting developed by Datadog. In addition to advancing the state of the art on generalized time series benchmarks in domains such as electricity and weather, this model is the first general-purpose time series forecasting foundation model to be specifically tuned for observability metrics. Toto was trained on a dataset of one trillion time series data points, the largest among all currently published time series foundation models. Alongside publicly available time series datasets, 75% of the data used to train Toto consists of fully anonymous numerical metric data points from the Datadog platform. In our experiments, Toto outperforms existing time series foundation models on observability data. It does this while also excelling at general-purpose forecasting tasks, achieving state-of-the-art zero-shot performance on multiple open benchmark datasets.

What problem does this paper attempt to address?

This paper mainly introduces a new type of time series optimization Transformer model called "Toto", which is designed for the prediction of observability metrics. Toto is currently the largest basic model for time series, trained on 1 trillion data points, with 75% of the data coming from anonymous numerical metric data on the Datadog platform. Toto performs well in handling observational data and also achieves state-of-the-art zero-shot performance in general time series prediction tasks. It introduces three key innovations: 1. Proportional factorized spatiotemporal attention mechanism, which effectively groups multivariate time series features, reducing computational burden while maintaining high precision. 2. Student-T mixture model head, which enhances the capturing of complex dynamics in time series through probabilistic modeling, surpassing traditional methods. 3. Domain-specific training data: In addition to multi-domain time series data, Toto has also been specifically pre-trained on Datadog observability metrics, enhancing its ability to predict time series with unique characteristics. The paper demonstrates that Toto outperforms existing basic time series models in observational data and achieves the best zero-shot prediction performance on multiple open benchmark datasets. Toto's architectural design considers real-time analysis and efficient scalability of large-scale data, making it particularly suitable for handling high-frequency and high-dimensional data, which are common in observability metrics. In addition, the paper discusses the limitations of traditional models such as ARIMA and exponential smoothing, and how Transformer models can become powerful tools for time series prediction through pre-training. Toto addresses challenges in observational data such as high temporal resolution, sparsity, extreme dynamic range, and non-stationarity through its unique attention mechanism and probabilistic prediction head, providing more accurate and efficient predictions.

Toto: Time Series Optimized Transformer for Observability

TFEformer: Temporal Feature Enhanced Transformer for Multivariate Time Series Forecasting

ETSformer: Exponential Smoothing Transformers for Time-series Forecasting

TGTOD: A Global Temporal Graph Transformer for Outlier Detection at Scale

ExoTST: Exogenous-Aware Temporal Sequence Transformer for Time Series Prediction

Enhancing Time Series Forecasting: A Hierarchical Transformer with Probabilistic Decomposition Representation

Unified Training of Universal Time Series Forecasting Transformers

Itransformer: Inverted Transformers Are Effective for Time Series Forecasting

Physically-guided Temporal Diffusion Transformer for Long-Term Time Series Forecasting

NTDformer: A Multi-Scale Forecasting Model for Non-Stationary Multilevel Time Series

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Lag-Llama: Towards Foundation Models for Time Series Forecasting

tsGT: Stochastic Time Series Modeling With Transformer

Probabilistic Decomposition Transformer for Time Series Forecasting

TODS: An Automated Time Series Outlier Detection System

Sparse transformer with local and seasonal adaptation for multivariate time series forecasting

Dateformer: Time-modeling Transformer for Longer-term Series Forecasting

Test Time Learning for Time Series Forecasting

A Temporal Kolmogorov-Arnold Transformer for Time Series Forecasting

The Tiny Time-series Transformer: Low-latency High-throughput Classification of Astronomical Transients using Deep Model Compression