Android Malware Detection with Unbiased Confidence Guarantees

Harris Papadopoulos,Nestoras Georgiou,Charalambos Eliades,Andreas Konstantinidis
DOI: https://doi.org/10.1016/j.neucom.2017.08.072
2023-12-17
Abstract:The impressive growth of smartphone devices in combination with the rising ubiquity of using mobile platforms for sensitive applications such as Internet banking, have triggered a rapid increase in mobile malware. In recent literature, many studies examine Machine Learning techniques, as the most promising approach for mobile malware detection, without however quantifying the uncertainty involved in their detections. In this paper, we address this problem by proposing a machine learning dynamic analysis approach that provides provably valid confidence guarantees in each malware detection. Moreover the particular guarantees hold for both the malicious and benign classes independently and are unaffected by any bias in the data. The proposed approach is based on a novel machine learning framework, called Conformal Prediction, combined with a random forests classifier. We examine its performance on a large-scale dataset collected by installing 1866 malicious and 4816 benign applications on a real android device. We make this collection of dynamic analysis data available to the research community. The obtained experimental results demonstrate the empirical validity, usefulness and unbiased nature of the outputs produced by the proposed approach.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve an important problem in mobile malware detection: **the lack of quantification of confidence in detection results**. Specifically, although many existing studies use machine - learning techniques to detect malware on the Android platform, they do not provide a method to quantify the uncertainty of their detection results. #### Main problems: 1. **Quantification of uncertainty**: - Existing malware detection methods fail to provide a reliable confidence guarantee for each detection result. - This makes it difficult for users to assess the reliability of the detection results, thus affecting their decisions (for example, whether to delete an application). 2. **Class imbalance problem**: - Malware detection data usually has a serious class imbalance problem (that is, there are far more benign applications than malicious applications). - This causes existing methods to be likely to be biased towards the prediction of the benign category when detecting malware, thus reducing the accuracy of malware detection. #### Solutions: To address these problems, the author proposes a dynamic analysis method based on the **Conformal Prediction (CP)** framework, combined with the Random Forests classifier, to provide **unbiased confidence guarantees**. Specific improvements include: - **Label - conditional Mondrian Conformal Prediction (LCMCP)**: Ensure that the confidence guarantee is valid for the malicious and benign categories respectively, without being affected by data bias. - **Inductive Conformal Prediction (ICP)**: Reduce the computational complexity, making it suitable for resource - constrained environments such as mobile devices. - **Verification on large - scale real - data sets**: Verify the effectiveness and practicality of the proposed method through a data set collected on a real Android device containing 6,682 applications. #### Goals: 1. **Provide stronger within - class confidence guarantees**: Ensure that effective confidence guarantees are provided for malicious and benign instances respectively, avoiding bias caused by class imbalance. 2. **Evaluate performance**: Evaluate the performance of the proposed method on a large - scale real - world data set and show its advantages over the traditional Random Forests classifier. Through these improvements, the author hopes to provide users with a more reliable malware detection tool, enabling users to make better decisions based on the confidence of the detection results.