Abstract:Data from observational studies (OSs) is widely available and readily obtainable yet frequently contains confounding biases. On the other hand, data derived from randomized controlled trials (RCTs) helps to reduce these biases; however, it is expensive to gather, resulting in a tiny size of randomized data. For this reason, effectively fusing observational data and randomized data to better estimate heterogeneous treatment effects (HTEs) has gained increasing attention. However, existing methods for integrating observational data with randomized data must require \textit{complete} observational data, meaning that both treated subjects and untreated subjects must be included in OSs. This prerequisite confines the applicability of such methods to very specific situations, given that including all subjects, whether treated or untreated, in observational studies is not consistently achievable. In our paper, we propose a resilient approach to \textbf{C}ombine \textbf{I}ncomplete \textbf{O}bservational data and randomized data for HTE estimation, which we abbreviate as \textbf{CIO}. The CIO is capable of estimating HTEs efficiently regardless of the completeness of the observational data, be it full or partial. Concretely, a confounding bias function is first derived using the pseudo-experimental group from OSs, in conjunction with the pseudo-control group from RCTs, via an effect estimation procedure. This function is subsequently utilized as a corrective residual to rectify the observed outcomes of observational data during the HTE estimation by combining the available observational data and the all randomized data. To validate our approach, we have conducted experiments on a synthetic dataset and two semi-synthetic datasets.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve the problem of how to effectively fuse incomplete observational data (Observational Data, OSs) and randomized data (Randomized Data, RCTs) to better estimate heterogeneous treatment effects (Heterogeneous Treatment Effects, HTEs). Specifically, the paper addresses the following key issues:
1. **Limitations of existing methods**: Existing methods for combining OSs and RCTs usually require complete observational data, that is, they must include both individuals who have received treatment and those who have not. However, in practical applications, obtaining complete observational data is often not feasible, which limits the application scope of these methods.
2. **Challenges of incomplete data**: In many cases, observational studies may not cover both the treatment group and the control group simultaneously, resulting in incomplete data. For example, in the early stage of a new drug's launch, patients may be more inclined to continue using existing drugs and are unwilling to try new drugs, which makes the data in the treatment group scarce or missing.
3. **The need to fuse incomplete data**: To solve the above problems, the paper proposes a new method named CIO (Combine Incomplete Observational data and randomized data for HTE estimation). CIO can still effectively estimate HTEs when the observational data is incomplete, thus expanding the application scenarios of data fusion methods.
### The core idea of the CIO method
- **Pseudo - treatment group and pseudo - control group**: By introducing a virtual treatment variable \(D = T(1 - S)\), where \(T\) represents whether to receive treatment and \(S\) represents the data source (0 represents from observational studies, 1 represents from randomized controlled trials). When \(T = 1\) and \(S = 0\), \(D = 1\); otherwise \(D = 0\). In this way, a pseudo - treatment group (\(D = 1\)) and a pseudo - control group (\(D = 0\)) can be formed to learn the confounding bias function.
- **Learning of the confounding bias function**: Use the data of the pseudo - treatment group and the pseudo - control group to learn the confounding bias function \(c(X)\), which describes the confounding bias in the observational data.
- **Adjusted result prediction**: After learning the confounding bias function, use this function to adjust the results in the observational data, and then combine all available observational and randomized data to estimate HTEs.
### Theoretical validity
The paper proves the effectiveness of the CIO method through theoretical analysis. Specifically:
- Under the potential outcome framework (Potential Outcome Framework, POF), the confounding bias function \(c(X)\) is identifiable.
- By correcting the confounding bias, the final HTE estimate is also identifiable.
### Experimental verification
To verify the effectiveness of the CIO method, the author conducted multiple experiments, including:
- **Synthetic data sets**: Used to test the performance of CIO under different data conditions.
- **Real - world data sets**: To further verify the performance of CIO in practical applications.
These experimental results show that CIO not only performs well under complete data conditions but also can maintain high estimation accuracy when some data are missing.
### Summary
The CIO method proposed in this paper provides a new solution that can effectively estimate HTEs in the case of incomplete observational data, thus expanding the application scope of existing data fusion techniques and improving the accuracy and reliability of the estimates.