Abstract:Reliable and robust evaluation methods are a necessary first step towards developing machine learning models that are themselves robust and reliable. Unfortunately, current evaluation protocols typically used to assess classifiers fail to comprehensively evaluate performance as they tend to rely on limited types of test data, and ignore others. For example, using the standard test data fails to evaluate the predictions made by the classifier to samples from classes it was not trained on. On the other hand, testing with data containing samples from unknown classes fails to evaluate how well the classifier can predict the labels for known classes. This article advocates bench-marking performance using a wide range of different types of data and using a single metric that can be applied to all such data types to produce a consistent evaluation of performance. Using such a benchmark it is found that current deep neural networks, including those trained with methods that are believed to produce state-of-the-art robustness, are extremely vulnerable to making mistakes on certain types of data. This means that such models will be unreliable in real-world scenarios where they may encounter data from many different domains, and that they are insecure as they can easily be fooled into making the wrong decisions. It is hoped that these results will motivate the wider adoption of more comprehensive testing methods that will, in turn, lead to the development of more robust machine learning methods in the future. Code is available at: <a class="link-external link-https" href="https://codeberg.org/mwspratling/RobustnessEvaluation" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of current evaluation methods for deep - learning image classifiers. Specifically, the author points out the following issues: 1. **Incomplete evaluation methods**: Existing evaluation protocols usually rely on a limited type of test data and ignore other types of data. For example: - Using standard test data cannot evaluate the classifier's ability to predict samples of untrained classes. - Using data containing unknown classes cannot evaluate the classifier's prediction performance for known classes. 2. **Lack of consistent evaluation metrics**: Current evaluation methods do not have a unified metric to measure performance on different types of test data, resulting in less consistent and reliable evaluation results. 3. **Model unreliability in the real world**: Existing deep neural networks, including those considered to have state - of - the - art robustness, are still very error - prone on certain types of data. This makes these models unreliable when encountering data from multiple different domains and easily misled to make wrong decisions. 4. **Trade - offs between evaluation methods**: There are trade - offs between different evaluation methods. For example, maximizing the accuracy of clean data may lead to poor adversarial robustness, and increasing adversarial robustness will reduce the accuracy of clean data. Similarly, many adversarial defense methods do not improve robustness to image corruption, and vice versa. To solve these problems, the author proposes a new evaluation benchmark, aiming to conduct a comprehensive evaluation using a wide range of different types of data and adopt a single metric to measure the performance of all data types, thereby providing consistent and reliable evaluation results. This method helps to discover the weaknesses of existing models on different data types and promotes the development of more robust machine - learning methods. ### Main improvement points 1. **Performance evaluation in terms of accuracy rather than error rate**: Change the Detection Error Rate (DER) proposed by Zhu et al. (2024) to Detection Accuracy Rate (DAR), so that a higher score corresponds to better performance, which is more intuitive and consistent with previously used metrics. 2. **Result aggregation method**: Average the results by task rather than by dataset to avoid the problem of evaluation bias towards a certain type of robustness due to the introduction of more datasets. Specifically, calculate a single metric value for all data of all datasets to ensure the fairness and representativeness of the evaluation results. Through these improvements, the author hopes to promote the wide application of more comprehensive and strict evaluation methods, thereby promoting the development of more robust machine - learning methods in the future.

A Comprehensive Assessment Benchmark for Rigorously Evaluating Deep Learning Image Classifiers

Fairness Testing of Deep Image Classification with Adequacy Metrics

A Comprehensive Study on Robustness of Image Classification Models: Benchmarking and Rethinking

A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification

Benchmarking Adversarial Robustness on Image Classification

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Benchmarking the Robustness of Deep Neural Networks to Common Corruptions in Digital Pathology

ROBY: Evaluating the adversarial robustness of a deep model by its decision boundaries

Image Classification with Small Datasets: Overview and Benchmark

Towards Precise Observations of Neural Model Robustness in Classification

Benchmarking Robustness of Deep Learning Classifiers Using Two-Factor Perturbation

Strengthening Machine Learning Reproducibility for Image Classification

Statistical Challenges with Dataset Construction: Why You Will Never Have Enough Images

Enhancing Post-Hoc Explanation Benchmark Reliability for Image Classification

Are Bias Mitigation Techniques for Deep Learning Effective?

The Methodological Pitfall of Dataset-Driven Research on Deep Learning: an IoT Example.

SoK: Certified Robustness for Deep Neural Networks

A Survey of Neural Network Robustness Assessment in Image Recognition

Evaluating the Robustness of Test Selection Methods for Deep Neural Networks

A Holistic Assessment of the Reliability of Machine Learning Systems

Tune It or Don't Use It: Benchmarking Data-Efficient Image Classification