DongTing: A large-scale dataset for anomaly detection of the Linux kernel

Guoyun Duan,Yuanzhi Fu,Minjie Cai,Hao Chen,Jianhua Sun
DOI: https://doi.org/10.1016/j.jss.2023.111745
IF: 3.5
2023-05-10
Journal of Systems and Software
Abstract:Host-based intrusion detection systems (HIDS) can automatically identify adversarial applications by learning models from system events that represent normal system behaviors. The system call is the only way for applications to interact with the operating system (OS). Thus, system call sequences are traditionally used in HIDS to train models to detect novel attacks, and a wide range of datasets has been proposed for this task. However, existing datasets are either built for user-level applications (not for OS kernels), or completely outdated (proposed more than 20 years ago). To address this issue, this paper presents the first large-scale dataset specifically assembled for anomaly detection of the Linux kernel. The task of creating such a dataset is challenging due to the difficulty both in collecting a diversified set of programs that can trigger bugs in the kernel and in tracing events that may crash the kernel at runtime. In this paper, we describe in detail how to collect the data through an automated and efficient framework. The raw dataset is 85 GB in size, and contains 18,966 system call sequences that are labeled with normal and abnormal attributes. Our dataset covers more than 200 kernel versions (including major/minor releases and revisions) and 3,600 bug-triggering programs in the past five years. In addition, we conduct cross-dataset evaluation to demonstrate that training on our dataset enables superior generalization ability than other related datasets, and provide benchmark results for anomaly detection of Linux kernel on our dataset. Our extensive dataset is both useful for machine learning researchers focusing on algorithmic optimizations and practitioners in kernel development who are interested in deploying deep learning models in OS kernels.
computer science, theory & methods, software engineering
What problem does this paper attempt to address?