Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

Ellen Graham,Marco Carone,Andrea Rotnitzky
2024-09-16
Abstract:We address the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources. Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint target distribution. While this theory proves effective in many significant contexts, it falls short in certain common data fusion problems, such as two-sample instrumental variable analysis, settings that integrate data from epidemiological studies with diverse designs (e.g., prospective cohorts and retrospective case-control studies), and studies with variables prone to measurement error that are supplemented by validation studies. In this paper, we extend the aforementioned comprehensive theory to allow for the fusion of individual-level data from sources aligned with conditional distributions that do not correspond to a single factorization of the target distribution. Assuming conditional and marginal distribution alignments, we provide universal results that characterize the class of all influence functions of regular asymptotically linear estimators and the efficient influence function of any pathwise differentiable parameter, irrespective of the number of data sources, the specific parameter of interest, or the statistical model for the target distribution. This theory paves the way for machine-learning debiased, semiparametric efficient estimation.
Statistics Theory,Methodology,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively fuse individual - level data from multiple independent data sources for statistical inference when estimating the parameters of interest. Specifically, the paper focuses on how to perform effective data fusion when the conditional distributions provided by different data sources may correspond to different decompositions of the target joint distribution. This includes, but is not limited to, two - sample instrumental variable analysis, combining epidemiological studies with different designs (such as prospective cohort studies and retrospective case - control studies), and correcting variables with measurement errors through external validation studies. ### Main contributions of the paper 1. **Expand existing theories**: The paper extends the existing comprehensive theories so that they can handle cases where the conditional distributions provided by different data sources do not correspond to the decomposition of a single target distribution. This extension enables the theories to be applied to a wider range of data fusion problems. 2. **General results**: The paper provides general results, describing the classes of influence functions of all regular asymptotically linear estimators and the efficient influence functions of any path - differentiable parameters, regardless of the number of data sources, specific parameters, or statistical models of the target distribution. 3. **Machine - learning de - biasing**: The paper provides a theoretical basis for machine - learning de - biasing and semi - parametric efficient estimation, which is especially important when dealing with complex data structures. ### Specific application scenarios - **Two - sample instrumental variable analysis**: How to combine data measured in two different data sources for instrumental variables and treatment effects, and for instrumental variables and outcomes respectively, for causal inference. - **Measurement error problem**: How to correct variables with measurement errors in the main study through external validation studies. - **Epidemiological studies with different designs**: How to integrate data from epidemiological studies with different designs (such as prospective cohort studies and retrospective case - control studies) to improve the efficiency and accuracy of estimation. ### Methodological innovations - **Model definition**: The paper introduces a coarsened - data model. Different from the missing - data model, the sample - unit combinations in the coarsened - data model do not represent a random sample of the ideal target population. The key lies in whether the observed data part is sufficient to identify the conditional distribution aligned with the target population. - **Influence functions and efficient influence functions**: The paper discusses in detail the characteristics of influence functions and efficient influence functions, especially the consistency issue under different data - source conditions. - **Algorithm implementation**: The paper provides an algorithm for calculating the influence function of observed data from the influence function of ideal data, and discusses the challenges in calculating the efficient influence function. ### Conclusion The paper provides a unified theoretical framework for statistical inference in individual - level data fusion, especially for cases where the conditional distributions provided by different data sources do not correspond to the decomposition of a single target distribution. This theory not only extends the application scope of existing methods but also provides a theoretical basis for machine - learning de - biasing and semi - parametric efficient estimation.