Naive Bayes Classifiers over Missing Data: Decision and Poisoning

Song Bian,Xiating Ouyang,Zhiwei Fan,Paraschos Koutris
2024-05-28
Abstract:We study the certifiable robustness of ML classifiers on dirty datasets that could contain missing values. A test point is certifiably robust for an ML classifier if the classifier returns the same prediction for that test point, regardless of which cleaned version (among exponentially many) of the dirty dataset the classifier is trained on. In this paper, we show theoretically that for Naive Bayes Classifiers (NBC) over dirty datasets with missing values: (i) there exists an efficient polynomial time algorithm to decide whether multiple input test points are all certifiably robust over a dirty dataset; and (ii) the data poisoning attack, which aims to make all input test points certifiably non-robust by inserting missing cells to the clean dataset, is in polynomial time for single test points but NP-complete for multiple test points. Extensive experiments demonstrate that our algorithms are efficient and outperform existing baselines.
Machine Learning,Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the certifiable robustness of machine - learning classifiers trained on "dirty data sets" containing missing values. Specifically, the researchers are concerned with: 1. **Decision - making problem**: Given an incomplete data set and test points, how can we efficiently determine whether these test points are certifiably robust for the Naive Bayes Classifier (NBC)? That is, regardless of how many possible worlds (i.e., the versions after data cleaning) are generated from the incomplete data set, whether the prediction results of the classifier for these test points are consistent. 2. **Data poisoning problem**: Given a clean data set and multiple test points, how can we make all the given test points become non - certifiably robust through the minimum number of cell modifications (i.e., setting the values of some cells to NULL)? This is actually an attack model, aiming to evaluate the vulnerability of the data set when facing malicious attacks. The main contributions of the paper include: - Proposing an algorithm with a time complexity of \(O(md+nd)\) to solve the decision - making problem for a single test point, where \(n\) is the number of data points in the data set, \(m\) is the number of labels, and \(d\) is the number of features. - For the data poisoning problem of a single test point, proposing an algorithm with a time complexity of \(O(nmd)\); for the case of multiple test points, proving that this problem is NP - complete and providing an efficient heuristic algorithm. - Through experiments on ten real - world data sets, verifying that the proposed algorithms are superior to the existing baseline methods in terms of efficiency and performance. The solutions to these problems are helpful for reducing the cost of data cleaning and improving the reliability and robustness of machine - learning models when dealing with data sets containing missing values.