Malware Detection Using Ensemble N-gram Opcode Sequences

Paul Ntim Yeboah,Stephen Kweku Amuquandoh,Haruna Balle Baz Musah
DOI: https://doi.org/10.3991/ijim.v15i24.25401
2021-12-21
International Journal of Interactive Mobile Technologies (iJIM)
Abstract:Conventional approaches to tackling malware attacks have proven to be futile at detecting never-before-seen (zero-day) malware. Research however has shown that zero-day malicious files are mostly semantic-preserving variants of already existing malware, which are generated via obfuscation methods. In this paper we propose and evaluate a machine learning based malware detection model using ensemble approach. We employ a strategy of ensemble where multiple feature sets generated from different n-gram sizes of opcode sequences are trained using a single classifier. Model predictions on the trained multi feature sets are weighted and combined on average to make a final verdict on whether a binary file is malicious or benign. To obtain optimal weight combination for the ensemble feature sets, we applied a grid search on a set of pre-defined weights in the range 0 to 1. With a balanced dataset of 2000 samples, an ensemble of n-gram opcode sequences of n sizes 1 and 2 with respective weight pair 0.3 and 0.7 yielded the best detection accuracy of 98.1% using random forest (RF) classifier. Ensemble n-gram sizes 2 and 3 obtained 99.7% as best precision using weight 0.5 for both models.
What problem does this paper attempt to address?