Malware Detection with Limited Supervised Information Via Contrastive Learning on API Call Sequences

Mohan Gao,Peng Wu,Li Pan
DOI: https://doi.org/10.1007/978-3-031-15777-6_27
2022-01-01
Abstract:Malware is a software capable of causing damage to computer systems. Conventional malware detection methods either require feature engineering to extract specific features or require a large amount of labeled data to train an end-to-end deep learning model. Both feature engineering and labelling are laborious. In this paper, we propose a semi-supervised contrastive learning malware detection method based on API call sequences with limited label information, called SCLMD. Specifically, a heterogeneous graph is constructed from API behavior to express the rich relationships among labeled and unlabeled software. After extracting the structural and sequential features of software by two encoders, we adopt the cross-view contrastive learning to obtain the shared and consistent feature of software. A hybrid positive selection strategy is designed to select positive pairs for contrastive learning by the guidance of the limited label information. Experimental results on two real world datasets show that the SCLMD outperforms the baseline methods, especially when the supervised information is limited.
What problem does this paper attempt to address?