Robust data integration from multiple external sources for generalized linear models with binary outcomes

Kyuseong Choi,Jeremy M G Taylor,Peisong Han
DOI: https://doi.org/10.1093/biomtc/ujad005
IF: 1.701
2024-01-29
Biometrics
Abstract:ABSTRACT We aim to estimate parameters in a generalized linear model (GLM) for a binary outcome when, in addition to the raw data from the internal study, more than 1 external study provides summary information in the form of parameter estimates from fitting GLMs with varying subsets of the internal study covariates. We propose an adaptive penalization method that exploits the external summary information and gains efficiency for estimation, and that is both robust and computationally efficient. The robust property comes from exploiting the relationship between parameters of a GLM and parameters of a GLM with omitted covariates and from downweighting external summary information that is less compatible with the internal data through a penalization. The computational burden associated with searching for the optimal tuning parameter for the penalization is reduced by using adaptive weights and by using an information criterion when searching for the optimal tuning parameter. Simulation studies show that the proposed estimator is robust against various types of population distribution heterogeneity and also gains efficiency compared to direct maximum likelihood estimation. The method is applied to improve a logistic regression model that predicts high-grade prostate cancer making use of parameter estimates from 2 external models.
statistics & probability,mathematical & computational biology,biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively integrate summary information from multiple external studies while maintaining the robustness of estimates and computational efficiency when constructing a generalized linear model (GLM) to predict binary classification results. Specifically, the paper proposes an adaptive penalty method, which can use external summary information to improve estimation efficiency and still maintain robustness in the case of heterogeneity in the distribution between internal and external data. ### Background and Motivation In many scientific studies, especially in the medical field, researchers often need to construct prediction models based on limited internal data sets. However, these internal data sets often have a small sample size, resulting in low estimation efficiency of the model. At this time, if the summary information (such as parameter estimates) from multiple external studies can be effectively utilized, the estimation accuracy and prediction ability of the model can be significantly improved. ### Main Problems 1. **How to effectively integrate external information**: When the internal data set is limited, how to reasonably use the summary information from multiple external studies to improve the estimation efficiency of model parameters. 2. **Handling data distribution heterogeneity**: In the case of differences in the distribution between internal and external data, how to ensure the robustness of model estimates and avoid estimation biases caused by inconsistent data distributions. ### Solutions The paper proposes an adaptive penalty method, and the main steps are as follows: 1. **Parameter re - parameterization**: Through orthogonalization processing, the parameters of the internal model are re - expressed in a parameter form related to external studies. 2. **Constructing the penalty term**: Design a penalty term, which will increase when the external information is inconsistent with the internal data, thereby reducing the dependence on inconsistent external information. 3. **Adaptive weights**: Introduce adaptive weights to dynamically adjust the weight of each external study according to the degree of consistency between the external information and the internal data. 4. **Selecting the optimal tuning parameter**: Use the generalized information criterion (GIC) to select the optimal tuning parameter value to balance the trade - off between the internal data likelihood function and the external information penalty term. ### Mathematical Expressions - **Internal model**: \[ g_I\{E_I[Y|X]\}=\beta_{I0}+\sum_{j = 1}^p\beta_{Ij}X_j \] - **External model**: \[ h^{(k)}\{E^{(k)}[Y|X^{(k)}]\}=\theta^{(k)}_0+\sum_{j\in I^{(k)}}\theta^{(k)}_jX_j \] - **Re - parameterization**: \[ g_I\{E_I[Y|X]\}=\phi^{(k)}_0+\sum_{j\in I^{(k)}}\phi^{(k)}_jX_j+\sum_{\ell\in I_c^{(k)}}\phi^{(k)}_\ell W_\ell \] where \(\phi^{(k)}_j=\beta_{Ij}+\sum_{\ell\in I_c^{(k)}}\beta_{I\ell}\gamma^{(k)}_{j\ell}\), \(\phi^{(k)}_\ell=\beta_{I\ell}\). - **Penalty term**: \[ P_n^{(k)}(\beta_I)=\sum_{r\neq s\in I^{(k)}}\left(\frac{\phi^{(k)}_r(\beta_I)\hat{\theta}^{(k)}_s-\phi^{(k)}_s(\beta_I)\hat{\theta}^{(k)}_r}{\phi^{(k)}_r(\beta_I)\hat{\theta}^{(k)}_s+\phi^{(k)}_s(\beta_I)\hat{\theta}^{(k)}_r}\right)^2 \] - **Adaptive weights**: \[ w_n^{(k)}=\frac{P_n^{(k)}(\h