THS-IDPC: A three-stage hierarchical sampling method based on improved density peaks clustering algorithm for encrypted malicious traffic detection

Liangchen Chen,Shu Gao,Baoxu Liu,Zhigang Lu,Zhengwei Jiang
DOI: https://doi.org/10.1007/s11227-020-03372-1
IF: 3.3
2020-06-29
The Journal of Supercomputing
Abstract:With the rapid increase in amount of network encrypted traffic and malware samples using encryption to evade identification, detecting encrypted malicious traffic presents challenges. The quality of the encrypted traffic sampling method directly affects the result of malware detection, but most existing machine learning methods for sampling flow-based encrypted traffic data are inherently inaccurate. To solve these problems, an innovative three-stage hierarchical sampling approach based on the improved density peaks clustering algorithm (THS-IDPC) is proposed to enhance the accuracy and efficiency of encrypted malicious traffic detection model. First, we propose an improved density peaks clustering algorithm based on grid screening, custom center decision value and mutual neighbor degree (DPC-GS-MND). In DPC-GS-MND, grid screening effectively reduces the computational complexity and mutual neighbor degree improves the clustering accuracy. Then, we extract and research the three categories features of encrypted traffic data related to malicious activities, and adopt a three-layer hierarchical clustering algorithm based on DPC-GS-MND. Finally, a three-stage sampling approach based on the three-layer hierarchical clustering algorithm (THS-IDPC) is proposed to sample the encrypted traffic data for further deep detection. The experimental results demonstrated that the proposed THS-IDPC is very effective to reduce normal traffic from massive network encrypted traffic simultaneously, and the encrypted malicious traffic detection model with THS-IDPC sampling method can detect multiple encrypted malicious traffic families with higher accuracy and efficiency. Meanwhile, DPC-GS-MND and THS-IDPC have good application prospects in network intrusion detection system under the big data environment.
What problem does this paper attempt to address?