DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware
Tiezhu Sun,Nadia Daoudi,Kisub Kim,Kevin Allix,Tegawendé F. Bissyandé,Jacques Klein
2024-08-29
Abstract:Recent advancements in ML and DL have significantly improved Android malware detection, yet many methodologies still rely on basic static analysis, bytecode, or function call graphs that often fail to capture complex malicious behaviors. DexBERT, a pre-trained BERT-like model tailored for Android representation learning, enriches class-level representations by analyzing Smali code extracted from APKs. However, its functionality is constrained by its inability to process multiple Smali classes simultaneously. This paper introduces DetectBERT, which integrates correlated Multiple Instance Learning (c-MIL) with DexBERT to handle the high dimensionality and variability of Android malware, enabling effective app-level detection. By treating class-level features as instances within MIL bags, DetectBERT aggregates these into a comprehensive app-level representation. Our evaluation demonstrates that DetectBERT not only surpasses existing state-of-the-art detection methods but also adapts to evolving malware threats. Moreover, the versatility of the DetectBERT framework holds promising potential for broader applications in app-level analysis and other software engineering tasks, offering new avenues for research and development.
Software Engineering,Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
The problems that this paper attempts to solve are as follows: Existing Android malware detection methods, especially those based on static analysis, bytecode or function call graphs, often fail to capture complex malicious behaviors. Moreover, although DexBERT performs excellently in class - level representation learning, it can only process a single Smali class at a time and cannot support overall application - level understanding.
To solve these problems, the paper introduces DetectBERT, a new model that combines Correlated Multiple Instance Learning (c - MIL) and DexBERT, aiming to handle the high - dimensionality and variability of Android malware, thereby achieving effective application - level detection. Specifically, DetectBERT improves the accuracy and adaptability of malware detection by regarding class - level features as instances in MIL bags and aggregating these instances into a comprehensive application - level representation.
The following are the main contributions of DetectBERT:
1. **Proposing DetectBERT**: An effective method of using MIL to extend DexBERT for complete application - level representation learning.
2. **Comprehensive evaluation**: It is shown that DetectBERT not only outperforms basic feature aggregation methods but also surpasses the existing state - of - the - art malware detection techniques.
3. **Time - consistency evaluation**: The robustness and continuous effectiveness of DetectBERT in the face of emerging malware samples are verified.
4. **Potential outlook**: The broad application potential of DetectBERT in the field of software engineering is emphasized, especially in efficiently handling large - scale data using MIL.
Through these improvements, DetectBERT can more effectively deal with the ever - evolving Android malware threats and provide higher detection accuracy and reliability.