Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

Zhouxing Shi,Nicholas Carlini,Ananth Balashankar,Ludwig Schmidt,Cho-Jui Hsieh,Alex Beutel,Yao Qin
2023-10-29
Abstract:"Effective robustness" measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate the ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data. To do this, we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of effective robustness when there are models with different training data. It may also explain the surprising effective robustness gains of zero-shot CLIP-like models exhibited in prior works that used ImageNet as the only ID test set, while the gains diminish under our new evaluation. Additional artifacts including interactive visualizations are provided at <a class="link-external link-https" href="https://shizhouxing.github.io/effective-robustness" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing effective robustness evaluation methods when comparing models trained on different data distributions. Specifically, existing effective robustness evaluations usually use a single test set (such as ImageNet) to evaluate the in - distribution (ID) performance of models, which becomes inaccurate when comparing models trained on different data distributions. For example, when comparing a model trained on ImageNet with a zero - shot language - image pre - trained model trained on LAION, this evaluation method may lead to misleading conclusions. ### Main contributions of the paper 1. **Reveal the limitations of existing effective robustness evaluations**: - Existing effective robustness evaluation methods rely on a fixed in - distribution (ID) test set, usually ImageNet. This method is feasible when all models are mainly trained on one dataset, but with the emergence of large - scale pre - trained models, when models trained on different data distributions need to be compared, this method becomes no longer applicable. 2. **Propose a new multi - ID effective robustness evaluation method**: - In order to more accurately evaluate and compare the effective robustness of models trained on different data distributions, the authors propose a method of using multiple ID test sets. These test sets cover the training distributions of all evaluated models. Through multi - dimensional linear regression, the out - of - distribution (OOD) accuracy is predicted from the accuracies of multiple ID test sets, thus providing a better estimate. 3. **Demonstrate the advantages of multi - ID evaluation**: - The authors demonstrate through experiments that using multiple ID test sets can better predict the OOD accuracy of various models (including zero - shot CLIP models), rather than relying solely on a single ID test set. This helps to explain the effective robustness gains of CLIP models observed in previous work, and under the new evaluation method, these gains are significantly reduced. ### Background and methods #### 1. Background - **Robustness under natural distribution shift**: Under natural distribution shift, the OOD accuracy of a model is usually related to the ID accuracy. After logit transformation of the accuracy, there is a linear trend between the ID accuracy and the OOD accuracy. - **Existing effective robustness evaluation**: Taori et al. (2020) proposed effective robustness to evaluate the part of the OOD performance that exceeds the expected given ID accuracy. The OOD accuracy is predicted by univariate linear regression, and the difference between the actual OOD accuracy and the predicted value is calculated. #### 2. Limitations of single - ID test set - **Problems with a single ID test set**: When using a single ID test set, the choice of different ID test sets will lead to contradictory conclusions. For example, when using ImageNet as an ID test set, the YFCC model seems to have higher effective robustness; while when using YFCC as an ID test set, the situation is reversed. #### 3. Multi - ID effective robustness - **Introduction of multi - ID test sets**: To solve the limitations of a single ID test set, the authors propose to use multiple ID test sets, each corresponding to a training data distribution. Through multi - dimensional linear regression, the OOD accuracy is predicted from the accuracies of multiple ID test sets. - **Extension of the baseline function**: Define a new baseline function \(\beta(x, y)\) to predict the OOD accuracy based on the accuracies \(x\) and \(y\) of two ID test sets: \[ \beta(x, y)=\text{expit}(w_x \logit(x)+w_y \logit(y)+b) \] where \(\logit(x)=\ln\left(\frac{x}{1 - x}\right)\) and \(\text{expit}(x)\) is the inverse function of \(\logit(x)\). ### Experimental results - **CIFAR - like OOD test set**: On the CIFAR - like OOD test set, the multi - ID evaluation method has better fitting quality and more accurate OOD accuracy prediction than the single - ID evaluation method. - **ImageNet - like OOD test set**: On the ImageNet - like OOD test set, the multi - ID evaluation method also improves the fitting quality and shows different data distributions