Improved Deep Learning Model for Static PE Files Malware Detection and Classification

Sumit S. Lad.,Amol C. Adamuthe,
DOI: https://doi.org/10.5815/ijcnis.2022.02.02
2022-04-08
International Journal of Computer Network and Information Security
Abstract:Static analysis and detection of malware is a crucial phase for handling security threats. Most researchers stated that the problem with the static analysis is an imbalance in the dataset, causing invalid result metrics. It requires more time for extracting features from the raw binaries, and methods like neural networks require more time for the training. Considering these problems, we proposed a model capable of building a feature set from the dataset and classifying static PE files efficiently. The research work was conducted to emphasize the importance of feature extraction rather than focusing on model building. The well-extracted features help to provide better results when fed to neural networks with minimal numbers of layers. Using minimum layers will enhance the performance of the model and take fewer resources and time for the processing and evaluation. In this research work, EMBER datasets published by Endgame Inc. containing PE file information are used. Feature extraction, data standardization, and data cleaning techniques are performed to handle the imbalance and impurities from the dataset. Later the extracted features were scaled into a standard form to avoid the problems related to range variations. A total of 2381 features are extracted and pre-processed from both the 2017 and 2018 datasets, respectively. The pre-processed data is then given to a deep learning model for training. The deep learning model created using dense and dropout layers to minimize the resource strain on the model and deliver more accurate results in less amount of time. The results obtained during experimentation for EMBER v2017 and v2018 datasets are 97.53% and 94.09%, respectively. The model is trained for ten epochs with a learning rate of 0.01, and it took 4 minutes/epoch, which is one minute lesser than the Decision Tree model. In terms of precision metrics, our model achieved 98.85%, which is 1.85% more as compared to the existing models.
What problem does this paper attempt to address?