Towards Backwards-Compatible Data with Confounded Domain Adaptation

Calvin McCarter
2024-11-11
Abstract:Most current domain adaptation methods address either covariate shift or label shift, but are not applicable where they occur simultaneously and are confounded with each other. Domain adaptation approaches which do account for such confounding are designed to adapt covariates to optimally predict a particular label whose shift is confounded with covariate shift. In this paper, we instead seek to achieve general-purpose data backwards compatibility. This would allow the adapted covariates to be used for a variety of downstream problems, including on pre-existing prediction models and on data analytics tasks. To do this we consider a modification of generalized label shift (GLS), which we call confounded shift. We present a novel framework for this problem, based on minimizing the expected divergence between the source and target conditional distributions, conditioning on possible confounders. Within this framework, we provide concrete implementations using the Gaussian reverse Kullback-Leibler divergence and the maximum mean discrepancy. Finally, we demonstrate our approach on synthetic and real datasets.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the domain adaptation problem when covariate shift and label shift occur simultaneously and are confounded with each other. Most of the existing domain adaptation methods can only handle one of these two shifts, but are unable to deal with the situation where they exist simultaneously and influence each other. Specifically, the author proposes a new framework to achieve general - purpose data backwards compatibility. This means that the adapted covariates can be used for multiple downstream tasks, including existing prediction models and data analysis tasks. To this end, the author introduces a modified version of the generalized label shift (GLS), called **confounded shift**. Under this framework, the author adapts the data by minimizing the expected divergence between the conditional distributions of the source and target domains (conditioned on possible confounders). ### Main contributions 1. **The concept of Confounded Shift**: The author proposes a new concept, namely confounded shift, which allows different covariate and label distributions between the source and target domains, but assumes that the conditional distribution can be made the same as that of the source domain by adapting the target covariates. 2. **New framework**: Based on minimizing the expected divergence between the conditional distributions of the source and target domains, the author provides a new framework and gives specific implementation methods, including using Gaussian reverse Kullback - Leibler divergence (Gaussian reverse KLD) and maximum mean discrepancy (MMD) as divergence functions. 3. **Application scenarios**: The author shows the application effects of this method on synthetic datasets and real - world datasets, especially in biomedical fields such as EEG data. ### Summary of mathematical formulas - **Gaussian reverse Kullback - Leibler divergence (Gaussian reverse KLD)**: \[ d_{\text{reverse - KLD}}(P, Q)=d_{\text{KL}}(Q \| P) \] where \(P\) and \(Q\) are the conditional distributions of the source and target domains respectively. - **Maximum mean discrepancy (MMD)**: \[ \text{MMD}^2(D_T, D_S)=\mathbb{E}_{x_1, x_1' \sim D_T} k_X(x_1, x_1')- 2\mathbb{E}_{x_1 \sim D_T, x_2 \sim D_S} k_X(x_1, A x_2 + b)+\mathbb{E}_{x_2, x_2' \sim D_S} k_X(A x_2 + b, A x_2' + b) \] ### Conclusion This paper solves the problem that existing domain adaptation methods cannot handle the situation where covariate shift and label shift exist simultaneously and are confounded with each other by introducing the concept of confounded shift and a novel framework. This provides a new solution for achieving general - purpose data backwards compatibility, making the adapted data applicable to multiple downstream tasks.