CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting

Josef Koumar,Karel Hynek,Tomáš Čejka,Pavel Šiška
2024-09-28
Abstract:Anomaly detection in network traffic is crucial for maintaining the security of computer networks and identifying malicious activities. One of the primary approaches to anomaly detection are methods based on forecasting. Nevertheless, extensive real-world network datasets for forecasting and anomaly detection techniques are missing, potentially causing performance overestimation of anomaly detection algorithms. This manuscript addresses this gap by introducing a dataset comprising time series data of network entities' behavior, collected from the CESNET3 network. The dataset was created from 40 weeks of network traffic of 275 thousand active IP addresses. The ISP origin of the presented data ensures a high level of variability among network entities, which forms a unique and authentic challenge for forecasting and anomaly detection models. It provides valuable insights into the practical deployment of forecast-based anomaly detection approaches.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem this paper attempts to address is the lack of real-world long-term datasets for network traffic anomaly detection. Specifically: 1. **Lack of Real Datasets**: Many datasets currently used to evaluate network traffic anomaly detection and prediction techniques are synthetic. These datasets cannot fully reflect the complexity and diversity of the real world, which may lead to an overestimation of algorithm performance. 2. **Dataset Limitations**: Existing real-world datasets are often short in duration or do not contain enough types of network entities, limiting the comprehensive evaluation of network traffic prediction and anomaly detection methods. 3. **Privacy Issues**: Due to privacy protection reasons, many real-world datasets cannot be made public, further exacerbating the above problems. To address these issues, the paper introduces a new dataset called CESNET-TimeSeries24. This dataset contains network traffic time series data from over 275,000 active IP addresses collected over 40 weeks from the Czech Education and Scientific Network (CESNET3). The characteristics of the dataset include: - **High Variability**: The data comes from an ISP network, ensuring a high diversity of network entities. - **Comprehensive Coverage**: The dataset includes all types of network anomalies identified by Chandola et al. and Basdekidou et al. - **Multiple Time Scales**: It provides aggregated data at 10-minute, 1-hour, and 1-day intervals to suit different application scenarios. - **Anonymization**: The data has been rigorously anonymized to ensure user privacy is not compromised. With this dataset, researchers can better evaluate and improve prediction-based network traffic anomaly detection methods, thereby enhancing network security.