Abstract:Consider a setting where there are $N$ heterogeneous units and $p$ interventions. Our goal is to learn unit-specific potential outcomes for any combination of these $p$ interventions, i.e., $N \times 2^p$ causal parameters. Choosing a combination of interventions is a problem that naturally arises in a variety of applications such as factorial design experiments, recommendation engines, combination therapies in medicine, conjoint analysis, etc. Running $N \times 2^p$ experiments to estimate the various parameters is likely expensive and/or infeasible as $N$ and $p$ grow. Further, with observational data there is likely confounding, i.e., whether or not a unit is seen under a combination is correlated with its potential outcome under that combination. To address these challenges, we propose a novel latent factor model that imposes structure across units (i.e., the matrix of potential outcomes is approximately rank $r$), and combinations of interventions (i.e., the coefficients in the Fourier expansion of the potential outcomes is approximately $s$ sparse). We establish identification for all $N \times 2^p$ parameters despite unobserved confounding. We propose an estimation procedure, Synthetic Combinations, and establish it is finite-sample consistent and asymptotically normal under precise conditions on the observation pattern. Our results imply consistent estimation given $\text{poly}(r) \times \left( N + s^2p\right)$ observations, while previous methods have sample complexity scaling as $\min(N \times s^2p, \ \ \text{poly(r)} \times (N + 2^p))$. We use Synthetic Combinations to propose a data-efficient experimental design. Empirically, Synthetic Combinations outperforms competing approaches on a real-world dataset on movie recommendations. Lastly, we extend our analysis to do causal inference where the intervention is a permutation over $p$ items (e.g., rankings).

On integrating the number of synthetic data sets $m$ into the 'a priori' synthesis approach

Using saturated count models for user-friendly synthesis of categorical data

Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence

Multiply-Imputed Synthetic Data: Advice to the Imputer

Synthetic Census Microdata Generation: A Comparative Study of Synthesis Methods Examining the Trade-Off Between Disclosure Risk and Utility

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

Synthesizing geocodes to facilitate access to detailed geographical information in large scale administrative data

Synthetic data method to incorporate external information into a current study

Combining information from independent sources through confidence distributions

Statistical disclosure control for numeric microdata via sequential joint probability preserving data shuffling

Utility Assessment of Synthetic Data Generation Methods

Synthetic Combinations: A Causal Inference Framework for Combinatorial Interventions

Balancing Inferential Integrity and Disclosure Risk via Model Targeted Masking and Multiple Imputation

Risk-Efficient Bayesian Data Synthesis for Privacy Protection

A density ratio framework for evaluating the utility of synthetic data

Bayesian Synthesis: Combining subjective analyses, with an application to ozone data

On the Equivalency, Substitutability, and Flexibility of Synthetic Data

Insufficient Gibbs sampling

Bayesian Estimation of Attribute Disclosure Risks in Synthetic Data with the $\texttt{AttributeRiskCalculation}$ R Package

Practical privacy metrics for synthetic data

Sequential Bayesian Data Synthesis for Mediation and Regression Analysis