A Knowledge Graph-based Sensitive Feature Selection for Android Malware Classification

Duoyuan Ma,Yude Bai,Zhenchang Xing,Lintan Sun,Xiaohong Li
DOI: https://doi.org/10.1109/apsec51365.2020.00027
2020-01-01
Abstract:The rapid increase in Android malware has brought great challenges to malware analysis. To deal with such a severe situation, it has been proposed an effective way which groups malware with common behaviors into the same malware family. Although there are many methods for malware family classification, the most critical and primary step is always the definition of sensitive behavior in an application, which will be beneficial for the later classification task. Much existing literature has manually selected sensitive features, such as permission, or even designed graph-based features via the control flow graph. They heavily depend on expert knowledge and time-consuming malware application analysis, which means it has to focus on the mal ware itself to dig out valuable security knowledge at first. However, the zooming malware overwhelms such expensive feature definition methods. To overcome such a problem, we adopt a knowledge graph-based sensitive feature selection method for Android mal ware classification. Based on the Android Developer documentation, an Android API knowledge graph is constructed at first. We can obtain not only permission but also related critical API from this graph. Note that both hyperlink relation and similarity relation are used to find out the critical API. With the knowledge graph-based sensitive features, we represent each Android malware as a boolean feature vector and send it in to a machine learning classifier for malware classification. We evaluate our proposed methods on three well-known Android malware datasets, such as Genome, Drebin, and AMD. The experimental results show that: 1) our proposed sensitive API is advantageous for malware detection; 2) API chosen by similarity relation can marginally improve performance; 3) different permission groups also make an influence for classification.
What problem does this paper attempt to address?