Data Redundancy May Lead To Unreliable Intrusion Detection Systems

Mohammed Al-Rawi,Yasmin Al-Zuqary,Firooz B. Saghezchi,Jie Yang,Joaquim Bastos,Jonathan Rodriguez
DOI: https://doi.org/10.1109/IWCMC.2017.7986573
2017-01-01
Abstract:An Intrusion Detection System (IDS) aims at protecting a network against attacks intended to exposing and/or vandalizing it. To build and test an IDS, network data are usually acquired containing attacks and normal behavior. The objective of this work is to use machine learning techniques to build IDSs and to investigate their reliability. To build and test the IDSs, KDDCUP99 has been used. The data contain a training set and a testing set with 4,898,430 samples (similar to 700MB) and 311,032 samples (similar to 45MB), respectively. However, the cleaned dataset via using SQL commands show that KDDCUP99 is highly redundant. The cleaned/distinct data are nearly one fifth of the original. Subsequently, experimental results have been performed using neural networks based IDSs. Some IDSs give low and median performances when tested using the redundant data and the distinct data, respectively, but other IDSs gave high and median performances using the redundant and the distinct data, respectively. Thus, there is a fluctuation in the performance when the data are redundant, which shows that an IDS built using a redundant dataset has unstable performance. The goal of preparing a balanced dataset is to only use it in testing the realistic performance of the IDS and has no relation to IDS generalization and implementation in real-world scenarios.
What problem does this paper attempt to address?