High-resolution Image-based Malware Classification using Multiple Instance Learning

Tim Peters,Hikmat Farhat
2023-11-22
Abstract:This paper proposes a novel method of classifying malware into families using high-resolution greyscale images and multiple instance learning to overcome adversarial binary enlargement. Current methods of visualisation-based malware classification largely rely on lossy transformations of inputs such as resizing to handle the large, variable-sized images. Through empirical analysis and experimentation, it is shown that these approaches cause crucial information loss that can be exploited. The proposed solution divides the images into patches and uses embedding-based multiple instance learning with a convolutional neural network and an attention aggregation function for classification. The implementation is evaluated on the Microsoft Malware Classification dataset and achieves accuracies of up to $96.6\%$ on adversarially enlarged samples compared to the baseline of $22.8\%$. The Python code is available online at <a class="link-external link-https" href="https://github.com/timppeters/MIL-Malware-Images" rel="external noopener nofollow">this https URL</a> .
Cryptography and Security,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of information loss in traditional image scaling methods when classifying malware using image-based approaches, due to the large size of malware samples, which makes the model susceptible to adversarial inflation attacks. Specifically: - **Information Loss Issue**: Current image-based malware classification methods typically rely on compressing or cropping large images to handle these variable-length input data, which leads to the loss of critical information. - **Adversarial Attacks**: Attackers can increase the image resolution by adding a large amount of redundant data to the original sample, causing the critical information to be interpolated away, resulting in decreased classification performance. - **Proposed Solution**: The paper proposes a new method that overcomes the above issues by segmenting high-resolution grayscale images into small blocks and using a combination of Multiple Instance Learning (MIL) and Convolutional Neural Networks (CNN) for classification. Experiments demonstrate that this method can achieve an accuracy of up to 96.6% on adversarial inflation samples, significantly higher than the 22.8% accuracy of traditional methods.