Abstract:Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of poor performance of existing image classification models in practical applications, specifically manifested as follows: 1. **Limitations of fixed benchmark datasets**: Although existing machine - learning models have achieved high accuracy on specific tasks, their performance in the real world is often not as expected. This is mainly because fixed benchmark datasets (such as ImageNet) cannot fully represent the sample diversity encountered after the model is deployed (Recht et al., 2019; Wu et al., 2023). These datasets are usually independently and identically distributed (i.i.d.) and cannot cover all possible variations. 2. **Insufficient robustness evaluation**: Current robustness evaluation methods are mainly divided into two categories: - **Perturbation - based evaluation**: Test the robustness of the model through predefined perturbations (such as adversarial attacks, noise addition, etc.). This type of method can usually maintain the image label structure, but the types of perturbations are limited. - **New - dataset - based evaluation**: Test the generalization ability of the model by constructing new datasets. Although this type of method can introduce more diverse samples, the cost of collecting and annotating these datasets is high, and it is difficult to update them once they are released. To solve these problems, the paper proposes a new robustness evaluation method, aiming to generate test samples that are sufficiently diverse and dynamically changing while maintaining the consistency of the image label structure. Specifically, the main contributions of the paper include: - **Introducing a new robustness metric**: Directly measure the robustness gap of the model by comparing the performance of the model with that of the foundation model. - **Designing a new evaluation protocol**: Use a pre - trained foundation model to generate new test samples. These samples are appropriately perturbed, different from existing test samples, and maintain the original image label structure. - **Systematically studying current robustness techniques**: Through the proposed evaluation protocol and metric, identify the robustness gap between existing models and the foundation model, provide insights into the behavior of deep - learning models, and point the way for future research. This method can not only more realistically reflect the performance of the model in the real world, but also avoid the over - fitting problem caused by existing fixed benchmark datasets, thus promoting the development of robustness evaluation.

Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models

Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study

A Comprehensive Study on Robustness of Image Classification Models: Benchmarking and Rethinking

ROBY: Evaluating the adversarial robustness of a deep model by its decision boundaries

Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework

Robustness Analysis on Foundational Segmentation Models

PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions

When Human Pose Estimation Meets Robustness: Adversarial Algorithms and Benchmarks

Foundation models in robotics: Applications, challenges, and the future

On the Out of Distribution Robustness of Foundation Models in Medical Image Segmentation

OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations

R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?

Testing Robustness Against Unforeseen Adversaries

Eureka: Evaluating and Understanding Large Foundation Models

When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery

MedFMC: A Real-world Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification

Measure and Improve Robustness in NLP Models: A Survey

A Survey on the Robustness of Computer Vision Models against Common Corruptions

When is a Foundation Model a Foundation Model