Abstract:Existing research on malware detection focuses almost exclusively on the detection rate. However, in some cases, it is also important to understand the results of our algorithm, or to obtain more information, such as where to investigate in the file for an analyst. In this aim, we propose a new model to analyze Portable Executable files. Our method consists in splitting the files in different sections, then transform each section into an image, in order to train convolutional neural networks to treat specifically each identified section. Then we use all these scores returned by CNNs to compute a final detection score, using models that enable us to improve our analysis of the importance of each section in the final score.
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two key issues in static malware detection:
1. **Improving the detection rate**: Existing malware detection research has almost entirely focused on improving the detection rate. However, the author believes that in some cases, understanding the results of the algorithm and obtaining more information (for example, the specific parts of the file that need to be investigated) are equally important.
2. **Enhancing the interpretability of the results**: In order to help analysts better understand which parts are most likely to be malicious, thereby reducing the investigation time, the author hopes to provide a method that can explain the importance of each file part.
To solve these problems, the author proposes a new model to analyze Portable Executable (PE) files. Specifically, they split different parts of the PE file and convert them into images, and then use Convolutional Neural Networks (CNNs) to train each part specifically. Finally, they combine the output scores of all these CNNs and calculate the final detection score through a scoring function, and this scoring function can help analyze the importance of each part in the final score.
### Method overview
1. **Dataset and pre - processing**:
- A reliable dataset was constructed using the Bodmas and PEMachineLearning datasets and 10,000 of the latest malware from VirusTotal.
- The different parts of the PE file were identified using the LIEF library, and each part was converted into a 64×64 grayscale image.
2. **Multi - CNN training**:
- A CNN was trained for each part separately. If a part does not exist, a special score (such as - 1) is assigned to this part to avoid introducing bias.
- Seven CNNs were trained, corresponding to seven main parts (.text,.data,.rdata,.rsrc,.reloc,.idata), and XGBoost and Random Forest were selected as the final scoring functions.
3. **Scoring function optimization**:
- Multiple scoring functions were tested, including XGBoost, Random Forest, and LightGBM, etc., and finally XGBoost and Random Forest were selected as the optimal models.
- The influence of different parts on the final classification was evaluated through feature importance analysis (such as MDI and feature permutation importance).
### Experimental results
- **Accuracy**: The accuracy on the test set is 0.96, and the F1 - score is also 0.96.
- **Performance improvement**: Compared with previous similar work, the accuracy has increased by 1.5% and has better interpretability.
- **Feature importance**: The.idata and.rsrc parts show high importance in the classification task.
### Conclusions and future work
- A distributed CNN model for classifying malware and benign files was proposed. By converting different parts of the PE file into images and training specific CNNs, efficient and interpretable detection was achieved.
- In the future, more parts can be added for analysis, more appropriate scoring functions can be explored, and the model structure can be further optimized.
Through this method, not only can the accuracy of malware detection be improved, but also more valuable information can be provided for analysts to help them locate potential threats more quickly.