Abstract:While many works have studied statistical data fusion, they typically assume that the various datasets are given in advance. However, in practice, estimation requires difficult data collection decisions like determining the available data sources, their costs, and how many samples to collect from each source. Moreover, this process is often sequential because the data collected at a given time can improve collection decisions in the future. In our setup, given access to multiple data sources and budget constraints, the agent must sequentially decide which data source to query to efficiently estimate a target parameter. We formalize this task using Online Moment Selection, a semiparametric framework that applies to any parameter identified by a set of moment conditions. Interestingly, the optimal budget allocation depends on the (unknown) true parameters. We present two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps. We prove that both policies achieve zero regret (assessed by asymptotic MSE) relative to an oracle policy. We empirically validate our methods on both synthetic and real-world causal effect estimation tasks, demonstrating that the online data collection policies outperform their fixed counterparts.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: given multiple data sources and budget constraints, how to effectively collect data to estimate the target parameters. Specifically, the author focuses on online data collection strategies, that is, how to decide which data sources to obtain more data from subsequently based on the existing partial data, so as to estimate statistical or causal parameters more efficiently. ### Problem Background Traditional data fusion research usually assumes that each data set is given in advance, ignoring the complexity of data collection decisions in practice. For example, in medical testing, survey design and other fields, data collection is costly and limited, so it is necessary to allocate resources reasonably to maximize the estimation effect. In addition, data collection is a continuous process, and data collected in the early stage can provide information for future decisions. ### Paper Contributions This paper proposes a framework named "Online Moment Selection (OMS)" to formalize this sequential decision - making problem. OMS combines the Generalized Method of Moments (GMM) to optimize the estimation of target parameters by selecting different data sources. Specifically, the paper proposes two online data collection strategies: - **Explore - then - Commit (ETC)**: Explore for a period of time first, and then determine the optimal combination of data sources according to the existing data. - **Explore - then - Greedy (ETG)**: After the exploration stage, continuously update the model parameters and dynamically adjust the selection of data sources. ### Main Conclusions The paper proves that both of these two strategies can achieve zero regret in an asymptotic sense, that is, their performance is close to the optimal strategy in the ideal situation. In addition, the experimental results show that these online strategies are superior to fixed data collection strategies in causal effect estimation tasks, with lower regret and mean square error (MSE). ### Mathematical Formulas To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper: 1. **Moment Conditions** \[ g_t(\theta, \eta)=m(s_t)\odot\begin{bmatrix} \psi^{(1)}(Z_t^{(1)}; \theta, \eta^{(1)})\\ \vdots\\ \psi^{(M)}(Z_t^{(M)}; \theta, \eta^{(M)}) \end{bmatrix}=\tilde{g}_t(\theta, \eta) \] where $\odot$ represents element - wise multiplication, $\theta\in\Theta\subset\mathbb{R}^D$ is a finite - dimensional parameter, and $\eta = (\eta^{(1)}, \ldots, \eta^{(M)})$ is a perturbation parameter that may be high - dimensional or non - parametric. 2. **GMM Objective Function** \[ \hat{\theta}_T=\arg\min_{\theta\in\Theta}Q_T(\theta,\{\hat{\eta}_{t - 1}\}_{t = 1}^T) \] where \[ Q_T(\theta,\{\hat{\eta}_{t - 1}\}_{t = 1}^T)=\left(\frac{1}{T}\sum_{t = 1}^T g_t(\theta,\hat{\eta}_{t - 1})\right)^T W_T\left(\frac{1}{T}\sum_{t = 1}^T g_t(\theta,\hat{\eta}_{t - 1})\right) \] 3. **Asymptotic Normality** \[ \sqrt{T}(\hat{\beta}_T-\beta^*)\xrightarrow{d}N_{\kappa_\infty}(0, V^*(\kap)

Online Data Collection for Efficient Semiparametric Inference

Efficient combination of observational and experimental datasets under general restrictions on outcome mean functions

On Collaboration in Distributed Parameter Estimation with Resource Constraints

Policy Learning with Adaptively Collected Data

Off-policy estimation with adaptively collected data: the power of online learning

Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

Combining Observational and Experimental Data to Improve Efficiency Using Imperfect Instruments

Semiparametric Efficient Inference in Adaptive Experiments

Online Policy Learning and Inference by Matrix Completion

Semiparametric Efficient Fusion of Individual Data and Summary Statistics

Globally-Optimal Greedy Experiment Selection for Active Sequential Estimation

Dynamic Sampling Policy for in Situ and Online Measurements Data Fusion in a Policy Network

Data-Driven Online Decision Making with Costly Information Acquisition

Online Updating of Statistical Inference in the Big Data Setting

Bayesian Online Multiple Testing: A Resource Allocation Approach

The Privacy Paradox and Optimal Bias–Variance Trade-offs in Data Acquisition

Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates

Online Estimation via Offline Estimation: An Information-Theoretic Framework

Estimation and Variable Selection for Semiparametric Transformation Models under a More Efficient Cohort Sampling Design

Online Causal Inference with Application to Near Real-Time Post-Market Vaccine Safety Surveillance

Efficient Multiple-Robust Estimation for Nonresponse Data Under Informative Sampling