BigRC-EML: big-data based ransomware classification using ensemble machine learning

Sana Aurangzeb,Haris Anwar,Muhammad Asif Naeem,Muhammad Aleem
DOI: https://doi.org/10.1007/s10586-022-03569-4
2022-03-15
Cluster Computing
Abstract:Ransomware is a subcategory of malware whose specific goal is to hold the victim's data by using encryption techniques until a ransom is paid. With mainstream usage of the Windows platform, Windows-based ransomware has become a great threat. With the rise of new malware categories and the huge volume of big data emerging, it has now become difficult to identify ransomware from benign applications. At the same time, ransomware detection and classification play a crucial role in computer security. Therefore, it is essential to analyze the behavior of ransomware samples to know their malicious nature that differs from clean applications. Due to the shortcomings of static analysis, we propose BigRC-EML for ransomware detection and classification based on several static and dynamic features. We use ensemble machine learning methods on big data to enhance the accuracy of the ransomware detection. Although, many machine learning models have been used in the detection of ransomware, yet, the evaluation of ensemble methods has not been investigated. Moreover, a new feature selection approach based on Principle Component Analysis (PCA) is presented to decrease the dimensions of the features. The datasets employed in the study comprised of two types: the first one is dynamic that comprises of 582 ransomware and 942 clean applications while the second one is hybrid that comprises of 500 applications. The classification models used are SVM, Random Forests, KNN, XGBoost, and Neural Network. Our experimental results show that Neural Network outperforms the other models and that BigRC-EML achieves an accuracy of 98% as well as can work under all types of data i.e. balanced, imbalanced, static, and dynamic. The experimental results successfully validate the effectiveness of the proposed approach by improving the classification accuracy of new ransomware.
What problem does this paper attempt to address?