On Defending Against Label Flipping Attacks on Malware Detection Systems

Rahim Taheri,Reza Javidan,Mohammad Shojafar,Zahra Pooranian,Ali Miri,Mauro Conti
DOI: https://doi.org/10.48550/arXiv.1908.04473
2020-06-16
Abstract:Label manipulation attacks are a subclass of data poisoning attacks in adversarial machine learning used against different applications, such as malware detection. These types of attacks represent a serious threat to detection systems in environments having high noise rate or uncertainty, such as complex networks and Internet of Thing (IoT). Recent work in the literature has suggested using the $K$-Nearest Neighboring (KNN) algorithm to defend against such attacks. However, such an approach can suffer from low to wrong detection accuracy. In this paper, we design an architecture to tackle the Android malware detection problem in IoT systems. We develop an attack mechanism based on Silhouette clustering method, modified for mobile Android platforms. We proposed two Convolutional Neural Network (CNN)-type deep learning algorithms against this \emph{Silhouette Clustering-based Label Flipping Attack (SCLFA)}. We show the effectiveness of these two defense algorithms - \emph{Label-based Semi-supervised Defense (LSD)} and \emph{clustering-based Semi-supervised Defense (CSD)} - in correcting labels being attacked. We evaluate the performance of the proposed algorithms by varying the various machine learning parameters on three Android datasets: Drebin, Contagio, and Genome and three types of features: API, intent, and permission. Our evaluation shows that using random forest feature selection and varying ratios of features can result in an improvement of up to 19\% accuracy when compared with the state-of-the-art method in the literature.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to propose effective defense methods against Label Flipping Attacks in malware detection systems. Specifically, label flipping attacks are a subclass of data poisoning attacks. Attackers mislead machine - learning models by tampering with the labels in the training data, thereby degrading their performance. Such attacks are particularly severe in complex network environments and Internet of Things (IoT) systems because there is a high noise rate or uncertainty in these environments. ### Main Problem Description in the Paper 1. **The Hazards of Label Flipping Attacks**: Label flipping attacks can significantly reduce the classification performance of machine - learning models by changing the labels of the training data, even if the attacker's other capabilities are limited. 2. **The Deficiencies of Existing Defense Methods**: Existing defense methods such as the KNN algorithm can be used to relabel samples, but they are not effective in dealing with label flipping attacks, especially when facing complex data sets. 3. **The Vulnerability of Deep - Learning Models**: Although deep neural networks (DNNs) perform well in classification tasks, they are very sensitive to label flipping attacks, resulting in a decline in accuracy. ### Goals of the Paper To address the above problems, the paper proposes the following goals: - Design an architecture to learn the flipped - label data and improve the robustness of the malware detection system. - Propose a label - flipping - attack method based on Silhouette Clustering to evaluate the vulnerability of existing systems. - Introduce two semi - supervised defense methods based on deep learning (LSD and CSD) to correct the attacked labels and improve classification accuracy. ### Main Contributions 1. **Proposed a New Attack Model**: Label Flipping Attack based on Silhouette Clustering (SCLFA), which selects appropriate samples for label flipping to deceive classification algorithms. 2. **Developed Two Defense Algorithms**: - **Label - based Semi - supervised Defense (LSD)**: Combines the Label Propagation (LP) and Label Spreading (LS) algorithms to predict and correct the flipped labels. - **Clustering - based Semi - supervised Defense (CSD)**: A semi - supervised defense method based on clustering algorithms, which uses four clustering metrics and validation data to relabel the contaminated labels. 3. **Experimental Verification**: Experiments were carried out on three real Android data sets (Drebin, Contagio, Genome), and the results show that the proposed defense methods improve the accuracy by up to 19% compared with the existing methods. ### Conclusion By proposing new attack and defense methods, this paper effectively solves the problem of label flipping attacks in malware detection systems and improves the robustness and accuracy of the system.