Slowing Down the Aging of Learning-Based Malware Detectors with API Knowledge

Xiaohan Zhang,Mi Zhang,Yuan Zhang,Ming Zhong,Xin Zhang,Yinzhi Cao,Min Yang
DOI: https://doi.org/10.1109/tdsc.2022.3144697
2022-01-01
IEEE Transactions on Dependable and Secure Computing
Abstract:Learning-based malware detectors are widely used in practice to safeguard real-world computers. One major challenge is known as model aging, where the effectiveness of these models drops drastically as malware variants keep evolving. To tackle model aging, most existing works choose to label new samples to retrain the aged models. However, such data-perspective methods often require excessive costs in labeling and retraining. In this article, we observe that during evolution, malware samples often preserve similar malicious semantics while switching to new implementations with semantically equivalent APIs. Such observation enables us to look into the problem from a different perspective: feature space. More specifically, if the models can capture the intrinsic semantics of malware variants from feature space, it will help slow down the aging of learning-based detectors. Based on this insight, we design APIGraph to automatically extract API knowledge from API documentation and incorporate these knowledge into the training of malware detection models. We use APIGraph to enhance 5 state-of-the-art malware detectors, covering both Android and Windows platforms and various learning algorithms. Experiments on large-scale, evolutionary datasets with nearly 340K samples show that APIGraph can help slow down the aging of these models by 5.9% to 19.6%, as well as reduce labeling efforts from 33.07% to 96.30% on top of data-perspective methods.
What problem does this paper attempt to address?