Relative Frequency-Rank Encoding for Unsupervised Network Anomaly Detection

Minsong Kim,Woohyuk Jang,JunNyung Hur,MyungKeun Yoon
DOI: https://doi.org/10.1109/tnet.2024.3391396
2024-08-25
IEEE/ACM Transactions on Networking
Abstract:Network-based anomaly detection plays a pivotal role in cybersecurity. Most detection models are based on unsupervised machine learning to learn such a normal flow pattern of network traffic as the numbers of incoming/outgoing packets, traffic volumes in bytes, duration time, etc., most of which are numerical features. On the contrary, non-numerical features have not been fully utilized yet although they often give a decisive hint to the detection of unseen attacks; for example, rarely observed combinations of IP addresses and port numbers can reveal an uncommon attack attempt. This heuristic has already been used by human experts for decades, but not fully utilized yet by deep learning models. In this paper, we present a new encoding scheme for non-numerical features such as IP addresses and port numbers that might have been mistakenly considered as numerical features. The new encoding scheme first ranks non-numerical features in their frequency order and then evenly places each rank between 0 and 1, which transforms raw data into a form that is easy for machines to understand. The anomaly detection performance is significantly improved when this new encoding scheme is applied to the same deep learning model. For example, a simple autoencoder model with the new encoding scheme achieved the Area Under Receiver Operating Characteristic, AUROC, of 0.99 for the well-known CICIDS2017 dataset while the previous record was 0.91. Experimental results from three different open datasets show that the proposed encoding scheme can significantly enhance the performance of anomaly detection models.
telecommunications,computer science, theory & methods,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?