An Efficient Traffic Classification Scheme Using Embedded Feature Selection and LightGBM

Yanpei Hua
DOI: https://doi.org/10.1109/ictc49638.2020.9123302
2020-05-01
Abstract:Machine Learning (ML) techniques have been widely used in anomaly-based Intrusion Detection System (IDS) in the big data era. Although many advanced approaches are proposed recently, there are still several key limitations that should not be ignored. Firstly, data pre-processing methods have not gained sufficient attention, and they are vital to efficiency of model training especially with massive volume of collected samples. Secondly, in spite that deep learning is able to acquire more hidden interrelations from input data, it usually suffers from high complexity with numerous parameters tuned and much training time consumed. Thirdly, KDD99 data sets and their variants are leveraged by most of the literature for traffic classification, but they are proved to be outdated and inadequate for IDS evaluation. Therefore, to cope with these challenges, in this paper, we firstly propose a data pre-processing approach with under-sampling and embedded feature selection, in order to relieve the imbalance of traffic samples and extract dominant features of incoming flows. Then, we utilize LightGBM to build an traffic classification approach for IDS with better accuracy and efficiency. Finally, we evaluate our proposed approaches based on CIC-IDS2018, the data set issued in 2018 that contains comprehensive real network traffic. Extensive experiments are performed, and related results have confirmed the advantages of our proposed approaches over several other comparisons.
What problem does this paper attempt to address?