FAMCF: A Few-Shot Android Malware Family Classification Framework

Fan Zhou,Dongxia Wang,Yanhai Xiong,Kun Sun,Wenhai Wang
DOI: https://doi.org/10.1016/j.cose.2024.104027
IF: 5.105
2024-01-01
Computers & Security
Abstract:Android malware is a major cyber threat to the popular Android platform which may influence millions of end users. To battle against Android malware, a large number of machine learning-based approaches have been developed, and have achieved promising results. However, the vast majority of the existing work relies on a large number of labeled samples which are unfortunately not available for the newly reported Android malware families. This poses a critical challenge to classify such few-shot Android malware families. . In this paper, we propose FAMCF, a novel few-shot learning-based classification pipeline to solve the problem. Faced with insufficient labeled samples from few-shot malware families, we learn how to extract features by training on another base dataset which is of a much larger scale but has disjoint label space with the few-shot families. We consider three types of features based on static analysis, namely permissions, API calls, and opcodes. We train a classifier for each type of features, utilizing a metric-based few-shot learning approach, and get an ensemble decision. Specifically, for each classifier, given a query sample to be classified, we propose to compare it to the prototypes of all the families, which are generated in a query-dependent way. We compared the classification performance of FAMCF to that of the existing solutions of multiple categories, including those traditional machine learning-based approaches, few-shot Android malware classification approaches, and also state-ofthe-art few-shot learning methods from other fields. We also analyzed robustness of FAMCF against multiple popular obfuscation techniques. The extensive experiments on the popular Drebin and CICInvesAndMal2019 datasets confirm the effectiveness and robustness of FAMCF in classifying few-shot Android malware families, e.g., we achieve at least 4.86% improvement on classification accuracy for Drebin and successfully kept the decrease in accuracy within 1% under the seven common types of obfuscation techniques.
What problem does this paper attempt to address?