MALIGN: Explainable Static Raw-byte Based Malware Family Classification using Sequence Alignment

Shoumik Saha,Sadia Afroz,Atif Rahman
DOI: https://doi.org/10.1016/j.cose.2024.103714
2024-01-13
Abstract:For a long time, malware classification and analysis have been an arms-race between antivirus systems and malware authors. Though static analysis is vulnerable to evasion techniques, it is still popular as the first line of defense in antivirus systems. But most of the static analyzers failed to gain the trust of practitioners due to their black-box nature. We propose MAlign, a novel static malware family classification approach inspired by genome sequence alignment that can not only classify malware families but can also provide explanations for its decision. MAlign encodes raw bytes using nucleotides and adopts genome sequence alignment approaches to create a signature of a malware family based on the conserved code segments in that family, without any human labor or expertise. We evaluate MAlign on two malware datasets, and it outperforms other state-of-the-art machine learning based malware classifiers (by 4.49% - 0.07%), especially on small datasets (by 19.48% - 1.2%). Furthermore, we explain the generated signatures by MAlign on different malware families illustrating the kinds of insights it can provide to analysts, and show its efficacy as an analysis tool. Additionally, we evaluate its theoretical and empirical robustness against some common attacks. In this paper, we approach static malware analysis from a unique perspective, aiming to strike a delicate balance among performance, interpretability, and robustness.
Cryptography and Security
What problem does this paper attempt to address?
The paper aims to address several key issues in static malware classification: 1. **Interpretability Issue**: Existing deep learning-based static analysis methods perform well in malware classification but lack transparency and interpretability due to their "black-box" nature, making it difficult for security experts to fully trust these models. 2. **Robustness Issue**: Current static malware classifiers are vulnerable to various types of adversarial attacks, such as those that modify bytes without changing the malware's semantics. 3. **Data Requirement Issue**: End-to-end deep learning models require a large amount of training data to achieve optimal performance, and it may be challenging to obtain sufficient samples in the short term when dealing with newly emerging malware variants. To address the above issues, the authors propose a new static malware family classification method called MAlign. MAlign uses sequence alignment techniques from bioinformatics to directly process raw bytecode, identifying common features among malware families and generating signatures for these families. This method not only effectively classifies malware families but also provides decision explanations, enhancing the model's trustworthiness, and theoretically proves its robustness against gradient-based attacks. Specifically, the main contributions of MAlign are as follows: - **Sequence Alignment Method**: By applying the concept of sequence alignment from bioinformatics to static malware family classification, MAlign can identify conserved regions among malware families and construct signatures for these families based on these regions. - **High Accuracy**: Experimental results on two datasets show that MAlign outperforms other state-of-the-art static detection methods in terms of classification accuracy, especially on small datasets. - **Interpretability**: Due to its design, MAlign can trace back to the exact code blocks that lead to classification decisions, providing explanations for the model's decisions. This helps security analysts understand the classification results and discover valuable information. - **Robustness**: MAlign is designed with defenses against adversarial attacks in mind, theoretically proving its robustness against gradient-based attacks and experimentally validating its practical robustness against gradient-based patch attacks. In summary, MAlign is an innovative static malware analysis technique that balances performance, interpretability, and robustness, aiming to enhance the capabilities of existing malware detection tools.