An Empirical Study of Accuracy, Fairness, Explainability, Distributional Robustness, and Adversarial Robustness

Moninder Singh,Gevorg Ghalachyan,Kush R. Varshney,Reginald E. Bryant
DOI: https://doi.org/10.48550/arXiv.2109.14653
2021-09-30
Abstract:To ensure trust in AI models, it is becoming increasingly apparent that evaluation of models must be extended beyond traditional performance metrics, like accuracy, to other dimensions, such as fairness, explainability, adversarial robustness, and distribution shift. We describe an empirical study to evaluate multiple model types on various metrics along these dimensions on several datasets. Our results show that no particular model type performs well on all dimensions, and demonstrate the kinds of trade-offs involved in selecting models evaluated along multiple dimensions.
Machine Learning,Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to evaluate and understand the performance of machine - learning models on multiple dimensions to ensure their trustworthiness and reliability. Specifically, the author focuses on the following aspects: 1. **Accuracy**: - Traditionally, the performance of a model is mainly measured by prediction accuracy, but relying solely on this metric may not be sufficient to comprehensively assess the quality of the model. 2. **Fairness**: - Whether the model shows fairness to different groups (such as race, gender, etc.), avoiding bias and discrimination. 3. **Explainability**: - Whether the prediction results of the model can be explained and understood, which is crucial for high - risk applications (such as medical, financial, etc.). 4. **Distributional Robustness**: - The performance of the model when the data distribution changes, for example, when the dataset migrates from one region to another. 5. **Adversarial Robustness**: - The robustness of the model when facing adversarial attacks (i.e., small perturbations lead to misclassification). ### Research Background and Motivation As machine - learning models are increasingly applied to high - risk fields (such as finance, medical, employment, and criminal justice), relying solely on traditional performance metrics (such as accuracy) is no longer sufficient to ensure the safety and credibility of these models. In order to make the models more trustworthy, other important performance dimensions, such as fairness, explainability, distributional robustness, and adversarial robustness, must be considered. However, it is still unclear how these different performance criteria interact with each other. Which criteria represent trade - off relationships? Which criteria can be improved simultaneously? How do different types of models perform on these dimensions? What is the impact of mitigation or defense techniques on these criteria? Are there consistent patterns among datasets in different application fields? ### Research Methods To answer the above questions, the author conducted an empirical study, evaluating the performance of multiple model types on multiple datasets and using multiple evaluation metrics. Specifically: - **Accuracy**: Use accuracy and balanced accuracy to evaluate the prediction performance of the model. \[ \text{Accuracy}=\frac{TP + TN}{TP + FP + TN + FN} \] \[ \text{Balanced Accuracy}=\frac{1}{2}\left(\frac{TP}{TP + FP}+\frac{TN}{TN + FN}\right) \] - **Fairness**: Use metrics such as Disparate Impact to evaluate the fairness of the model. \[ \text{Disparate Impact}=\frac{P(Y = 1|D=\text{unprivileged})}{P(Y = 1|D=\text{privileged})} \] - **Explainability**: Use LIME to generate local explanations and calculate the faithfulness of the explanations. - **Adversarial Robustness**: Use the HopSkipJump algorithm to generate adversarial samples and calculate the Empirical Robustness. - **Distributional Robustness**: Evaluate the performance of the model under distribution changes by creating shifted datasets. ### Datasets The author selected eight different datasets, covering multiple fields, including housing loans, poverty prediction, income prediction, bank marketing, African financial inclusion, medical expenditure, German credit scoring, and heart disease prediction. ### Experimental Design Four classification algorithms were used in the experiment: Gradient Boosting Classifier (GBC), Random Forest (RF), Logistic Regression (LR), and Multi - Layer Perceptron (MLP). Five - fold cross - validation was performed on each dataset, and the following steps were carried out in each split: 1. Build four models using the training set and perform on the test set.