Abstract:We address the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources. Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint target distribution. While this theory proves effective in many significant contexts, it falls short in certain common data fusion problems, such as two-sample instrumental variable analysis, settings that integrate data from epidemiological studies with diverse designs (e.g., prospective cohorts and retrospective case-control studies), and studies with variables prone to measurement error that are supplemented by validation studies. In this paper, we extend the aforementioned comprehensive theory to allow for the fusion of individual-level data from sources aligned with conditional distributions that do not correspond to a single factorization of the target distribution. Assuming conditional and marginal distribution alignments, we provide universal results that characterize the class of all influence functions of regular asymptotically linear estimators and the efficient influence function of any pathwise differentiable parameter, irrespective of the number of data sources, the specific parameter of interest, or the statistical model for the target distribution. This theory paves the way for machine-learning debiased, semiparametric efficient estimation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively fuse individual - level data from multiple independent data sources for statistical inference when estimating the parameters of interest. Specifically, the paper focuses on how to perform effective data fusion when the conditional distributions provided by different data sources may correspond to different decompositions of the target joint distribution. This includes, but is not limited to, two - sample instrumental variable analysis, combining epidemiological studies with different designs (such as prospective cohort studies and retrospective case - control studies), and correcting variables with measurement errors through external validation studies. ### Main contributions of the paper 1. **Expand existing theories**: The paper extends the existing comprehensive theories so that they can handle cases where the conditional distributions provided by different data sources do not correspond to the decomposition of a single target distribution. This extension enables the theories to be applied to a wider range of data fusion problems. 2. **General results**: The paper provides general results, describing the classes of influence functions of all regular asymptotically linear estimators and the efficient influence functions of any path - differentiable parameters, regardless of the number of data sources, specific parameters, or statistical models of the target distribution. 3. **Machine - learning de - biasing**: The paper provides a theoretical basis for machine - learning de - biasing and semi - parametric efficient estimation, which is especially important when dealing with complex data structures. ### Specific application scenarios - **Two - sample instrumental variable analysis**: How to combine data measured in two different data sources for instrumental variables and treatment effects, and for instrumental variables and outcomes respectively, for causal inference. - **Measurement error problem**: How to correct variables with measurement errors in the main study through external validation studies. - **Epidemiological studies with different designs**: How to integrate data from epidemiological studies with different designs (such as prospective cohort studies and retrospective case - control studies) to improve the efficiency and accuracy of estimation. ### Methodological innovations - **Model definition**: The paper introduces a coarsened - data model. Different from the missing - data model, the sample - unit combinations in the coarsened - data model do not represent a random sample of the ideal target population. The key lies in whether the observed data part is sufficient to identify the conditional distribution aligned with the target population. - **Influence functions and efficient influence functions**: The paper discusses in detail the characteristics of influence functions and efficient influence functions, especially the consistency issue under different data - source conditions. - **Algorithm implementation**: The paper provides an algorithm for calculating the influence function of observed data from the influence function of ideal data, and discusses the challenges in calculating the efficient influence function. ### Conclusion The paper provides a unified theoretical framework for statistical inference in individual - level data fusion, especially for cases where the conditional distributions provided by different data sources do not correspond to the decomposition of a single target distribution. This theory not only extends the application scope of existing methods but also provides a theoretical basis for machine - learning de - biasing and semi - parametric efficient estimation.

Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

Probabilistic Data Fusion for Short-Term Traffic Prediction with Semiparametric Density Ratio Model

Semiparametric Efficient Fusion of Individual Data and Summary Statistics

On Semiparametric Instrumental Variable Estimation of Average Treatment Effects Through Data Fusion

Data fusion using weakly aligned sources

Paradoxes and resolutions for semiparametric fusion of individual and summary data

Calibrated regression estimation using empirical likelihood under data fusion

Learning Instrumental Variable from Data Fusion for Treatment Effect Estimation

Ageing is not associated with an altered immune response during Trypanosoma cruzi infection Ageing and Trypanosoma cruzi infection

Invited Commentary: Estimation and Bounds under Data Fusion

Robust Direct Learning for Causal Data Fusion

Inference for Large Dimensional Factor Models under General Missing Data Patterns

Causal Data Fusion Methods Using Summary‐level Statistics for a Continuous Outcome

Nonparametric fusion learning: synthesize inferences from diverse sources using depth confidence distribution

Data-fusion using factor analysis and low-rank matrix completion

Incorporating Covariates into Integrated Factor Analysis of Multi-View Data

Integrative analysis of individual-level data and high-dimensional summary statistics

Fusion of Probability Density Functions

Combining heterogeneous spatial datasets with process-based spatial fusion models: A unifying framework

Bayesian data fusion with shared priors

Multi-Source Conformal Inference Under Distribution Shift