A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series

Lucas Correia,Jan-Christoph Goos,Thomas Bäck,Anna V. Kononova
2025-01-16
Abstract:Benchmarking anomaly detection approaches for multivariate time series is challenging due to the lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a small selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data.
Machine Learning,Artificial Intelligence,Computational Engineering, Finance, and Science,Systems and Control
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the lack of high - quality datasets currently used to evaluate online unsupervised anomaly detection methods for multivariate time - series. The existing public datasets have the following problems: 1. **Small scale**: The sample size of the existing datasets is insufficient and cannot provide enough diversity. 2. **Lack of diversity**: The time - series features in the existing datasets are single and cannot cover complex real - world scenarios. 3. **Simple anomalies**: The anomalies in the existing datasets are too simple and cannot reflect the complexity in practical applications. These problems limit the substantial progress in this research field. Therefore, the author proposes a new solution: construct a diverse, extensive and non - trivial dataset (the PATH dataset), which is generated by the state - of - the - art simulation tools and reflects the multivariate, dynamic and variable - state characteristics of the automotive powertrain. Specifically, this dataset aims to solve the following problems: - **Lack of diversity**: Ensure the diversity of the dataset by introducing multiple driving cycles and random initial conditions (such as battery temperature and state of charge). - **Lack of complexity in anomalies**: Ensure the complexity and authenticity of anomalies by simulating six different types of anomalies (such as turning off regenerative braking, increasing headwind resistance, etc.). - **Online detection requirements**: Support unsupervised and semi - supervised anomaly detection settings, as well as time - series generation and prediction tasks by providing different versions of datasets containing contaminated and clean training subsets. ### Key contributions 1. **High - quality dataset**: Propose a new dataset named PATH, which has high complexity and realism and can better reflect practical application scenarios. 2. **Diverse anomaly types**: Introduce multiple anomaly types, including subsequence anomalies and full - sequence anomalies, to increase the challenge of the dataset. 3. **Baseline experiment results**: Provide baseline experiment results based on deterministic and variational auto - encoders and non - parametric methods to verify the effectiveness of the dataset. Through these improvements, this paper provides a more reliable and more challenging benchmark platform for the research of multivariate time - series anomaly detection.