Abstract:Existing research on malware detection focuses almost exclusively on the detection rate. However, in some cases, it is also important to understand the results of our algorithm, or to obtain more information, such as where to investigate in the file for an analyst. In this aim, we propose a new model to analyze Portable Executable files. Our method consists in splitting the files in different sections, then transform each section into an image, in order to train convolutional neural networks to treat specifically each identified section. Then we use all these scores returned by CNNs to compute a final detection score, using models that enable us to improve our analysis of the importance of each section in the final score.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are two key issues in static malware detection: 1. **Improving the detection rate**: Existing malware detection research has almost entirely focused on improving the detection rate. However, the author believes that in some cases, understanding the results of the algorithm and obtaining more information (for example, the specific parts of the file that need to be investigated) are equally important. 2. **Enhancing the interpretability of the results**: In order to help analysts better understand which parts are most likely to be malicious, thereby reducing the investigation time, the author hopes to provide a method that can explain the importance of each file part. To solve these problems, the author proposes a new model to analyze Portable Executable (PE) files. Specifically, they split different parts of the PE file and convert them into images, and then use Convolutional Neural Networks (CNNs) to train each part specifically. Finally, they combine the output scores of all these CNNs and calculate the final detection score through a scoring function, and this scoring function can help analyze the importance of each part in the final score. ### Method overview 1. **Dataset and pre - processing**: - A reliable dataset was constructed using the Bodmas and PEMachineLearning datasets and 10,000 of the latest malware from VirusTotal. - The different parts of the PE file were identified using the LIEF library, and each part was converted into a 64×64 grayscale image. 2. **Multi - CNN training**: - A CNN was trained for each part separately. If a part does not exist, a special score (such as - 1) is assigned to this part to avoid introducing bias. - Seven CNNs were trained, corresponding to seven main parts (.text,.data,.rdata,.rsrc,.reloc,.idata), and XGBoost and Random Forest were selected as the final scoring functions. 3. **Scoring function optimization**: - Multiple scoring functions were tested, including XGBoost, Random Forest, and LightGBM, etc., and finally XGBoost and Random Forest were selected as the optimal models. - The influence of different parts on the final classification was evaluated through feature importance analysis (such as MDI and feature permutation importance). ### Experimental results - **Accuracy**: The accuracy on the test set is 0.96, and the F1 - score is also 0.96. - **Performance improvement**: Compared with previous similar work, the accuracy has increased by 1.5% and has better interpretability. - **Feature importance**: The.idata and.rsrc parts show high importance in the classification task. ### Conclusions and future work - A distributed CNN model for classifying malware and benign files was proposed. By converting different parts of the PE file into images and training specific CNNs, efficient and interpretable detection was achieved. - In the future, more parts can be added for analysis, more appropriate scoring functions can be explored, and the model structure can be further optimized. Through this method, not only can the accuracy of malware detection be improved, but also more valuable information can be provided for analysts to help them locate potential threats more quickly.

Use of Multi-CNNs for Section Analysis in Static Malware Detection

Malware Analysis Using Machine Learning and Deep Learning Techniques

Malicious Code Detection Based on CNNs and Multi-Objective Algorithm

A Novel Approach to Malicious Code Detection Using CNN-BiLSTM and Feature Fusion

Malware detection system based on static and dynamic analysis and using machine learning

Leveraging deep learning and image conversion of executable files for effective malware detection: A static malware analysis approach

Towards an in-depth detection of malware using distributed QCNN

Study of a Hybrid Approach Towards Malware Detection in Executable Files

Malware Classification Based on Image Segmentation

IMCMK-CNN: A Lightweight Convolutional Neural Network with Multi-scale Kernels for Image-based Malware Classification

Image-Based Malware Classification Method with the AlexNet Convolutional Neural Network Model

Guided Malware Sample Analysis based on Graph Neural Networks

Interpretable Detection of Malicious Behavior in Windows Portable Executables Using Multi-Head 2D Transformers

Malware Detection with Malware Images Using Deep Learning Techniques

A Unique Approach to Malware Detection Using Deep Convolutional Neural Networks

NtMalDetect: A Machine Learning Approach to Malware Detection Using Native API System Calls

Classifying Malware Images with Convolutional Neural Network Models

Data augmentation based malware detection using convolutional neural networks

Deep Multi-Task Learning for Malware Image Classification

A Malware Detection Method Based on Genetic Algorithm Optimized CNN-Senet Network

Dual Convolutional Malware Network (DCMN): An Image-Based Malware Classification Using Dual Convolutional Neural Networks