Machine Learning and Big Data Processing for Cybersecurity Data Analysis

Igor Kotenko,Igor Saenko,Alexander Branitskiy
DOI: https://doi.org/10.1007/978-3-030-38788-4_4
2020-01-01
Abstract:The chapter presents an approach to cybersecurity data analysis based on the combination of a set of machine learning methods and Big Data technologies for network attack and anomaly detection. The approach is characterized by several layers of data processing, including extraction and decomposition of datasets, compression of feature vectors, training, and classification. To reduce the dimension of the analyzed feature vectors, principal component analysis is applied. Various binary classifiers are used for analyzing the input vector using principal component analysis: support vector machine, k-nearest neighbors, Gaussian naïve Bayes, artificial neural network, and decision tree. In order to increase the precision of attack detection, it is proposed to combine these classifiers into a single weighted ensemble. This is constructed on the basis of weighted voting, soft voting, AdaBoost, and majority voting. Two different architectures of the distributed intrusion detection system based on Big Data technologies are used. In the first, parallel data processing is achieved by splitting data into several non-intersecting subsets, and a separate parallel thread is assigned to each of the formed chunks. In the second, several client-sensors and a server-collector are used, where each sensor contains several network analyzers and a balancer. The efficiency of the suggested approach for network attack and anomaly detection is experimentally evaluated using two different datasets: a dataset with Internet of Things traffic including several kinds of different classes of attacks; and a dataset with computer network traffic containing host scanning and DDoS attacks.
What problem does this paper attempt to address?