Abstract:"Effective robustness" measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate the ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data. To do this, we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of effective robustness when there are models with different training data. It may also explain the surprising effective robustness gains of zero-shot CLIP-like models exhibited in prior works that used ImageNet as the only ID test set, while the gains diminish under our new evaluation. Additional artifacts including interactive visualizations are provided at <a class="link-external link-https" href="https://shizhouxing.github.io/effective-robustness" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing effective robustness evaluation methods when comparing models trained on different data distributions. Specifically, existing effective robustness evaluations usually use a single test set (such as ImageNet) to evaluate the in - distribution (ID) performance of models, which becomes inaccurate when comparing models trained on different data distributions. For example, when comparing a model trained on ImageNet with a zero - shot language - image pre - trained model trained on LAION, this evaluation method may lead to misleading conclusions. ### Main contributions of the paper 1. **Reveal the limitations of existing effective robustness evaluations**: - Existing effective robustness evaluation methods rely on a fixed in - distribution (ID) test set, usually ImageNet. This method is feasible when all models are mainly trained on one dataset, but with the emergence of large - scale pre - trained models, when models trained on different data distributions need to be compared, this method becomes no longer applicable. 2. **Propose a new multi - ID effective robustness evaluation method**: - In order to more accurately evaluate and compare the effective robustness of models trained on different data distributions, the authors propose a method of using multiple ID test sets. These test sets cover the training distributions of all evaluated models. Through multi - dimensional linear regression, the out - of - distribution (OOD) accuracy is predicted from the accuracies of multiple ID test sets, thus providing a better estimate. 3. **Demonstrate the advantages of multi - ID evaluation**: - The authors demonstrate through experiments that using multiple ID test sets can better predict the OOD accuracy of various models (including zero - shot CLIP models), rather than relying solely on a single ID test set. This helps to explain the effective robustness gains of CLIP models observed in previous work, and under the new evaluation method, these gains are significantly reduced. ### Background and methods #### 1. Background - **Robustness under natural distribution shift**: Under natural distribution shift, the OOD accuracy of a model is usually related to the ID accuracy. After logit transformation of the accuracy, there is a linear trend between the ID accuracy and the OOD accuracy. - **Existing effective robustness evaluation**: Taori et al. (2020) proposed effective robustness to evaluate the part of the OOD performance that exceeds the expected given ID accuracy. The OOD accuracy is predicted by univariate linear regression, and the difference between the actual OOD accuracy and the predicted value is calculated. #### 2. Limitations of single - ID test set - **Problems with a single ID test set**: When using a single ID test set, the choice of different ID test sets will lead to contradictory conclusions. For example, when using ImageNet as an ID test set, the YFCC model seems to have higher effective robustness; while when using YFCC as an ID test set, the situation is reversed. #### 3. Multi - ID effective robustness - **Introduction of multi - ID test sets**: To solve the limitations of a single ID test set, the authors propose to use multiple ID test sets, each corresponding to a training data distribution. Through multi - dimensional linear regression, the OOD accuracy is predicted from the accuracies of multiple ID test sets. - **Extension of the baseline function**: Define a new baseline function \(\beta(x, y)\) to predict the OOD accuracy based on the accuracies \(x\) and \(y\) of two ID test sets: \[ \beta(x, y)=\text{expit}(w_x \logit(x)+w_y \logit(y)+b) \] where \(\logit(x)=\ln\left(\frac{x}{1 - x}\right)\) and \(\text{expit}(x)\) is the inverse function of \(\logit(x)\). ### Experimental results - **CIFAR - like OOD test set**: On the CIFAR - like OOD test set, the multi - ID evaluation method has better fitting quality and more accurate OOD accuracy prediction than the single - ID evaluation method. - **ImageNet - like OOD test set**: On the ImageNet - like OOD test set, the multi - ID evaluation method also improves the fitting quality and shows different data distributions

Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

Models Out of Line: A Fourier Lens on Distribution Shift Robustness

Dynamic robustness evaluation for automated model selection in operation

Benchmarking Low-Shot Robustness to Natural Distribution Shifts

OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution Shift

OOD-CV-v2 : An extended Benchmark for Robustness to Out-of-Distribution Shifts of Individual Nuisances in Natural Images

OOD-CV-v2: An extended Benchmark for Robustness to Out-of-Distribution Shifts of Individual Nuisances in Natural Images

Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations

Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts

An Empirical Study on Distribution Shift Robustness from the Perspective of Pre-Training and Data Augmentation

Robust Computer Vision in an Ever-Changing World: A Survey of Techniques for Tackling Distribution Shifts

Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study

A Comprehensive Study on Robustness of Image Classification Models: Benchmarking and Rethinking

Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts

ROBY: Evaluating the adversarial robustness of a deep model by its decision boundaries

Non-adversarial Robustness of Deep Learning Methods for Computer Vision

On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models

Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains

Robust Validation: Confident Predictions Even When Distributions Shift

Generalizability of Adversarial Robustness Under Distribution Shifts

Do-GOOD: Towards Distribution Shift Evaluation for Pre-Trained Visual Document Understanding Models