Abstract:<p>New malware variants appear rapidly and continuously increase the difficulty to classify malware into correct families. This brings two challenges for malware classification: The first is the scarce samples problem, where collecting a large volume of a newly detected malware family to train a classifier can be extremely hard and it is unavoidable to suffer from overfitting using a small number of samples. The second is the dynamic recognition problem. Most widely adopted classifiers are trained on predefined known malware families, lacking ability to incrementally identifying novel families, which require to retrain from scratch. To tackle these challenges, in this study, we employ the meta-learning based few-shot learning (FSL) technique and propose a new few-shot malware classification model called SIMPLE (Supervised Infinite Mixture Prototypes LEarning). With the help of meta-learning, SIMPLE is trained with predefined malware families and can maintain its ability to classify novel malware families that has never met. Furthermore, the prior knowledge learned via meta-learning can prevent from overfitting caused by scarce samples. Our proposed SIMPLE introduces multi-prototype modeling to generate multiple prototypes of each family to enhance the generalization, based on API invocation sequences from dynamic analysis. This is inspired by the observation that behaviors within the same family often match multiple subpatterns and satisfy multimodal data distribution. In the broad experiments, SIMPLE achieves the state-of-the-art few-shot malware classification performance and outperforms all the baselines. With only 5 samples per family, SIMPLE reaches a very high accuracy of 90% in 5-way classification task on novel malware families, which substantially solves the problem of scarce samples and dynamic recognition. We also make analysis on the reason of effectiveness with multi-prototype and fast adaption feature to provide more interpretability for the results.</p>

A novel few-shot malware classification approach for unknown family recognition with multi-prototype modeling