Abstract:We study the certifiable robustness of ML classifiers on dirty datasets that could contain missing values. A test point is certifiably robust for an ML classifier if the classifier returns the same prediction for that test point, regardless of which cleaned version (among exponentially many) of the dirty dataset the classifier is trained on. In this paper, we show theoretically that for Naive Bayes Classifiers (NBC) over dirty datasets with missing values: (i) there exists an efficient polynomial time algorithm to decide whether multiple input test points are all certifiably robust over a dirty dataset; and (ii) the data poisoning attack, which aims to make all input test points certifiably non-robust by inserting missing cells to the clean dataset, is in polynomial time for single test points but NP-complete for multiple test points. Extensive experiments demonstrate that our algorithms are efficient and outperform existing baselines.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the certifiable robustness of machine - learning classifiers trained on "dirty data sets" containing missing values. Specifically, the researchers are concerned with: 1. **Decision - making problem**: Given an incomplete data set and test points, how can we efficiently determine whether these test points are certifiably robust for the Naive Bayes Classifier (NBC)? That is, regardless of how many possible worlds (i.e., the versions after data cleaning) are generated from the incomplete data set, whether the prediction results of the classifier for these test points are consistent. 2. **Data poisoning problem**: Given a clean data set and multiple test points, how can we make all the given test points become non - certifiably robust through the minimum number of cell modifications (i.e., setting the values of some cells to NULL)? This is actually an attack model, aiming to evaluate the vulnerability of the data set when facing malicious attacks. The main contributions of the paper include: - Proposing an algorithm with a time complexity of \(O(md+nd)\) to solve the decision - making problem for a single test point, where \(n\) is the number of data points in the data set, \(m\) is the number of labels, and \(d\) is the number of features. - For the data poisoning problem of a single test point, proposing an algorithm with a time complexity of \(O(nmd)\); for the case of multiple test points, proving that this problem is NP - complete and providing an efficient heuristic algorithm. - Through experiments on ten real - world data sets, verifying that the proposed algorithms are superior to the existing baseline methods in terms of efficiency and performance. The solutions to these problems are helpful for reducing the cost of data cleaning and improving the reliability and robustness of machine - learning models when dealing with data sets containing missing values.

Naive Bayes Classifiers over Missing Data: Decision and Poisoning

Noise is the Fatal Poison: A Noise-aware Network for Noisy Dataset Classification

Systematic Testing of the Data-Poisoning Robustness of KNN

A Framework of Randomized Selection Based Certified Defenses Against Data Poisoning Attacks

Certified Defenses for Data Poisoning Attacks

Outlier-Oriented Poisoning Attack: A Grey-box Approach to Disturb Decision Boundaries by Perturbing Outliers in Multiclass Learning

Robust Bayesian Classification with Incomplete Data

Stronger Data Poisoning Attacks Break Data Sanitization Defenses

How to Sift Out a Clean Data Subset in the Presence of Data Poisoning?

On the Relevance of Byzantine Robust Optimization Against Data Poisoning

Reducing Certified Regression to Certified Classification for General Poisoning Attacks

Deep k-NN Defense Against Clean-Label Data Poisoning Attacks

Deep Probabilistic Models to Detect Data Poisoning Attacks

Certified Robustness to Data Poisoning in Gradient-Based Training

Classification Learning From Private Data In Heterogeneous Settings

Mixing Classifiers to Alleviate the Accuracy-Robustness Trade-Off

Poisoning Network Flow Classifiers

Towards Fair Classification against Poisoning Attacks

Pick your Poison: Undetectability versus Robustness in Data Poisoning Attacks

On Collective Robustness of Bagging Against Data Poisoning

What Distributions are Robust to Indiscriminate Poisoning Attacks for Linear Learners?