Online Data Collection for Efficient Semiparametric Inference

Shantanu Gupta,Zachary C. Lipton,David Childers
2024-11-05
Abstract:While many works have studied statistical data fusion, they typically assume that the various datasets are given in advance. However, in practice, estimation requires difficult data collection decisions like determining the available data sources, their costs, and how many samples to collect from each source. Moreover, this process is often sequential because the data collected at a given time can improve collection decisions in the future. In our setup, given access to multiple data sources and budget constraints, the agent must sequentially decide which data source to query to efficiently estimate a target parameter. We formalize this task using Online Moment Selection, a semiparametric framework that applies to any parameter identified by a set of moment conditions. Interestingly, the optimal budget allocation depends on the (unknown) true parameters. We present two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps. We prove that both policies achieve zero regret (assessed by asymptotic MSE) relative to an oracle policy. We empirically validate our methods on both synthetic and real-world causal effect estimation tasks, demonstrating that the online data collection policies outperform their fixed counterparts.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: given multiple data sources and budget constraints, how to effectively collect data to estimate the target parameters. Specifically, the author focuses on online data collection strategies, that is, how to decide which data sources to obtain more data from subsequently based on the existing partial data, so as to estimate statistical or causal parameters more efficiently. ### Problem Background Traditional data fusion research usually assumes that each data set is given in advance, ignoring the complexity of data collection decisions in practice. For example, in medical testing, survey design and other fields, data collection is costly and limited, so it is necessary to allocate resources reasonably to maximize the estimation effect. In addition, data collection is a continuous process, and data collected in the early stage can provide information for future decisions. ### Paper Contributions This paper proposes a framework named "Online Moment Selection (OMS)" to formalize this sequential decision - making problem. OMS combines the Generalized Method of Moments (GMM) to optimize the estimation of target parameters by selecting different data sources. Specifically, the paper proposes two online data collection strategies: - **Explore - then - Commit (ETC)**: Explore for a period of time first, and then determine the optimal combination of data sources according to the existing data. - **Explore - then - Greedy (ETG)**: After the exploration stage, continuously update the model parameters and dynamically adjust the selection of data sources. ### Main Conclusions The paper proves that both of these two strategies can achieve zero regret in an asymptotic sense, that is, their performance is close to the optimal strategy in the ideal situation. In addition, the experimental results show that these online strategies are superior to fixed data collection strategies in causal effect estimation tasks, with lower regret and mean square error (MSE). ### Mathematical Formulas To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper: 1. **Moment Conditions** \[ g_t(\theta, \eta)=m(s_t)\odot\begin{bmatrix} \psi^{(1)}(Z_t^{(1)}; \theta, \eta^{(1)})\\ \vdots\\ \psi^{(M)}(Z_t^{(M)}; \theta, \eta^{(M)}) \end{bmatrix}=\tilde{g}_t(\theta, \eta) \] where $\odot$ represents element - wise multiplication, $\theta\in\Theta\subset\mathbb{R}^D$ is a finite - dimensional parameter, and $\eta = (\eta^{(1)}, \ldots, \eta^{(M)})$ is a perturbation parameter that may be high - dimensional or non - parametric. 2. **GMM Objective Function** \[ \hat{\theta}_T=\arg\min_{\theta\in\Theta}Q_T(\theta,\{\hat{\eta}_{t - 1}\}_{t = 1}^T) \] where \[ Q_T(\theta,\{\hat{\eta}_{t - 1}\}_{t = 1}^T)=\left(\frac{1}{T}\sum_{t = 1}^T g_t(\theta,\hat{\eta}_{t - 1})\right)^T W_T\left(\frac{1}{T}\sum_{t = 1}^T g_t(\theta,\hat{\eta}_{t - 1})\right) \] 3. **Asymptotic Normality** \[ \sqrt{T}(\hat{\beta}_T-\beta^*)\xrightarrow{d}N_{\kappa_\infty}(0, V^*(\kap)