Abstract:Despite empirical risk minimization (ERM) is widely applied in the machine learning community, its performance is limited on data with spurious correlation or subpopulation that is introduced by hidden attributes. Existing literature proposed techniques to maximize group-balanced or worst-group accuracy when such correlation presents, yet, at the cost of lower average accuracy. In addition, many existing works conduct surveys on different subpopulation methods without revealing the inherent connection between these methods, which could hinder the technology advancement in this area. In this paper, we identify important sampling as a simple yet powerful tool for solving the subpopulation problem. On the theory side, we provide a new systematic formulation of the subpopulation problem and explicitly identify the assumptions that are not clearly stated in the existing works. This helps to uncover the cause of the dropped average accuracy. We provide the first theoretical discussion on the connections of existing methods, revealing the core components that make them different. On the application side, we demonstrate a single estimator is enough to solve the subpopulation problem. In particular, we introduce the estimator in both attribute-known and -unknown scenarios in the subpopulation setup, offering flexibility in practical use cases. And empirically, we achieve state-of-the-art performance on commonly used benchmark datasets.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of performance degradation of machine - learning models in the face of subpopulation shift. Specifically, the authors are concerned with the poor performance of the Empirical Risk Minimization (ERM) method when the distributions of training data and test data are inconsistent. This distribution inconsistency is usually caused by spurious correlations introduced by hidden attributes (such as color, background, etc.). #### Main problem description 1. **Subpopulation shift**: When the proportions of different sub - populations in the training set and the test set change, it will lead to a significant decline in the performance of the model on the test set. 2. **Limitations of existing methods**: Although existing methods can achieve good results in worst - group accuracy, they often sacrifice average accuracy. 3. **Lack of theoretical explanation**: Current research has not provided sufficient theoretical explanations for why these methods reduce average accuracy, which hinders the progress of the technology. #### Main contributions of the paper 1. **Propose the importance sampling framework (DBA framework)**: By introducing importance sampling, the authors provide a new systematic formula to deal with sub - population problems and clearly point out the assumptions that are not clearly stated in existing work. 2. **Reveal the reasons for the decline in average accuracy**: Through theoretical analysis, the authors reveal that the reason for the decline in average accuracy is the mismatch between the learning objective and the test data set. 3. **Unify existing methods**: The DBA framework can unify existing methods into a statistical framework and clearly point out their core differences. 4. **Propose solutions**: The authors propose three different estimation methods, which are suitable for sub - population problems in different scenarios, and prove that these methods can improve test performance in practical applications. #### Formula summary - **Weight function of importance sampling**: \[ z(x, y, I_{\text{va}}, I_{\text{te}}) := \frac{p(x, y | I_{\text{te}})}{p(x, y | I_{\text{va}})} \] - **Optimization objective**: \[ \max_{q \in M_{\text{tr}}} \mathbb{E}_{(x,y) \sim p(x,y|I_{\text{va}})}[\log q(y|x, I_{\text{tr}})] \] - **Weight function \(g(x, y, I_{\text{tr}}, I_{\text{te}})\)**: \[ g(x, y, I_{\text{tr}}, I_{\text{te}})^{-1} := \frac{p(m_0 | I_{\text{tr}}) + p(m_1 | I_{\text{tr}}) \cdot L}{p(y | I_{\text{tr}}) \cdot p(y | m_1, I_{\text{tr}}) \left(1 + h \frac{p(m_0 | I_{\text{tr}}) \cdot p(y | I_{\text{tr}}) / L + p(y | m_1, I_{\text{tr}})}{p(m_0 | I_{\text{tr}}) \cdot p(y | I_{\text{tr}})} \right) \cdot \left(1 - \frac{p(s = y | y, x, I_{\text{tr}})}{p(s = y | y, x, I_{\text{tr}})}\right)} \] Through these methods and theoretical analyses, the authors not only solve the sub - population bias.

Boosting Test Performance with Importance Sampling--a Subpopulation Perspective

Evaluating Model Performance Under Worst-case Subpopulations

UMIX: Improving Importance Weighting for Subpopulation Shift Via Uncertainty-Aware Mixup

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources

Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values

Efficient Importance Sampling for Rare Event Simulation with Applications

Optimal Subsampling Approaches for Large Sample Linear Regression

Subsampled Optimization: Statistical Guarantees, Mean Squared Error Approximation, and Sampling Method

An empirical evaluation of sampling methods for the classification of imbalanced data

Less Is Better: Unweighted Data Subsampling via Influence Function

Reweighted Mixup for Subpopulation Shift

Optimal Subsampling Bootstrap for Massive Data

A sub-sampling algorithm preventing outliers

Optimal subsampling algorithm for the marginal model with large longitudinal data

A model robust sub-sampling approach for Generalised Linear Models in Big data settings

Change is Hard: A Closer Look at Subpopulation Shift

On Harmonizing Implicit Subpopulations

Nonasymptotic Bounds for Suboptimal Importance Sampling

Estimation and testing of expectile regression with efficient subsampling for massive data

Entropy and Confidence-Based Undersampling Boosting Random Forests for Imbalanced Problems.

Rare Event Prediction Using Similarity Majority Under-Sampling Technique