Boosting Test Performance with Importance Sampling--a Subpopulation Perspective

Hongyu Shen,Zhizhen Zhao
2024-12-17
Abstract:Despite empirical risk minimization (ERM) is widely applied in the machine learning community, its performance is limited on data with spurious correlation or subpopulation that is introduced by hidden attributes. Existing literature proposed techniques to maximize group-balanced or worst-group accuracy when such correlation presents, yet, at the cost of lower average accuracy. In addition, many existing works conduct surveys on different subpopulation methods without revealing the inherent connection between these methods, which could hinder the technology advancement in this area. In this paper, we identify important sampling as a simple yet powerful tool for solving the subpopulation problem. On the theory side, we provide a new systematic formulation of the subpopulation problem and explicitly identify the assumptions that are not clearly stated in the existing works. This helps to uncover the cause of the dropped average accuracy. We provide the first theoretical discussion on the connections of existing methods, revealing the core components that make them different. On the application side, we demonstrate a single estimator is enough to solve the subpopulation problem. In particular, we introduce the estimator in both attribute-known and -unknown scenarios in the subpopulation setup, offering flexibility in practical use cases. And empirically, we achieve state-of-the-art performance on commonly used benchmark datasets.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of performance degradation of machine - learning models in the face of subpopulation shift. Specifically, the authors are concerned with the poor performance of the Empirical Risk Minimization (ERM) method when the distributions of training data and test data are inconsistent. This distribution inconsistency is usually caused by spurious correlations introduced by hidden attributes (such as color, background, etc.). #### Main problem description 1. **Subpopulation shift**: When the proportions of different sub - populations in the training set and the test set change, it will lead to a significant decline in the performance of the model on the test set. 2. **Limitations of existing methods**: Although existing methods can achieve good results in worst - group accuracy, they often sacrifice average accuracy. 3. **Lack of theoretical explanation**: Current research has not provided sufficient theoretical explanations for why these methods reduce average accuracy, which hinders the progress of the technology. #### Main contributions of the paper 1. **Propose the importance sampling framework (DBA framework)**: By introducing importance sampling, the authors provide a new systematic formula to deal with sub - population problems and clearly point out the assumptions that are not clearly stated in existing work. 2. **Reveal the reasons for the decline in average accuracy**: Through theoretical analysis, the authors reveal that the reason for the decline in average accuracy is the mismatch between the learning objective and the test data set. 3. **Unify existing methods**: The DBA framework can unify existing methods into a statistical framework and clearly point out their core differences. 4. **Propose solutions**: The authors propose three different estimation methods, which are suitable for sub - population problems in different scenarios, and prove that these methods can improve test performance in practical applications. #### Formula summary - **Weight function of importance sampling**: \[ z(x, y, I_{\text{va}}, I_{\text{te}}) := \frac{p(x, y | I_{\text{te}})}{p(x, y | I_{\text{va}})} \] - **Optimization objective**: \[ \max_{q \in M_{\text{tr}}} \mathbb{E}_{(x,y) \sim p(x,y|I_{\text{va}})}[\log q(y|x, I_{\text{tr}})] \] - **Weight function \(g(x, y, I_{\text{tr}}, I_{\text{te}})\)**: \[ g(x, y, I_{\text{tr}}, I_{\text{te}})^{-1} := \frac{p(m_0 | I_{\text{tr}}) + p(m_1 | I_{\text{tr}}) \cdot L}{p(y | I_{\text{tr}}) \cdot p(y | m_1, I_{\text{tr}}) \left(1 + h \frac{p(m_0 | I_{\text{tr}}) \cdot p(y | I_{\text{tr}}) / L + p(y | m_1, I_{\text{tr}})}{p(m_0 | I_{\text{tr}}) \cdot p(y | I_{\text{tr}})} \right) \cdot \left(1 - \frac{p(s = y | y, x, I_{\text{tr}})}{p(s = y | y, x, I_{\text{tr}})}\right)} \] Through these methods and theoretical analyses, the authors not only solve the sub - population bias.