Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models

Peiyan Zhang,Haoyang Liu,Chaozhuo Li,Xing Xie,Sunghun Kim,Haohan Wang
2024-05-16
Abstract:Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of poor performance of existing image classification models in practical applications, specifically manifested as follows: 1. **Limitations of fixed benchmark datasets**: Although existing machine - learning models have achieved high accuracy on specific tasks, their performance in the real world is often not as expected. This is mainly because fixed benchmark datasets (such as ImageNet) cannot fully represent the sample diversity encountered after the model is deployed (Recht et al., 2019; Wu et al., 2023). These datasets are usually independently and identically distributed (i.i.d.) and cannot cover all possible variations. 2. **Insufficient robustness evaluation**: Current robustness evaluation methods are mainly divided into two categories: - **Perturbation - based evaluation**: Test the robustness of the model through predefined perturbations (such as adversarial attacks, noise addition, etc.). This type of method can usually maintain the image label structure, but the types of perturbations are limited. - **New - dataset - based evaluation**: Test the generalization ability of the model by constructing new datasets. Although this type of method can introduce more diverse samples, the cost of collecting and annotating these datasets is high, and it is difficult to update them once they are released. To solve these problems, the paper proposes a new robustness evaluation method, aiming to generate test samples that are sufficiently diverse and dynamically changing while maintaining the consistency of the image label structure. Specifically, the main contributions of the paper include: - **Introducing a new robustness metric**: Directly measure the robustness gap of the model by comparing the performance of the model with that of the foundation model. - **Designing a new evaluation protocol**: Use a pre - trained foundation model to generate new test samples. These samples are appropriately perturbed, different from existing test samples, and maintain the original image label structure. - **Systematically studying current robustness techniques**: Through the proposed evaluation protocol and metric, identify the robustness gap between existing models and the foundation model, provide insights into the behavior of deep - learning models, and point the way for future research. This method can not only more realistically reflect the performance of the model in the real world, but also avoid the over - fitting problem caused by existing fixed benchmark datasets, thus promoting the development of robustness evaluation.