Abstract:ABSTRACT We aim to estimate parameters in a generalized linear model (GLM) for a binary outcome when, in addition to the raw data from the internal study, more than 1 external study provides summary information in the form of parameter estimates from fitting GLMs with varying subsets of the internal study covariates. We propose an adaptive penalization method that exploits the external summary information and gains efficiency for estimation, and that is both robust and computationally efficient. The robust property comes from exploiting the relationship between parameters of a GLM and parameters of a GLM with omitted covariates and from downweighting external summary information that is less compatible with the internal data through a penalization. The computational burden associated with searching for the optimal tuning parameter for the penalization is reduced by using adaptive weights and by using an information criterion when searching for the optimal tuning parameter. Simulation studies show that the proposed estimator is robust against various types of population distribution heterogeneity and also gains efficiency compared to direct maximum likelihood estimation. The method is applied to improve a logistic regression model that predicts high-grade prostate cancer making use of parameter estimates from 2 external models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively integrate summary information from multiple external studies while maintaining the robustness of estimates and computational efficiency when constructing a generalized linear model (GLM) to predict binary classification results. Specifically, the paper proposes an adaptive penalty method, which can use external summary information to improve estimation efficiency and still maintain robustness in the case of heterogeneity in the distribution between internal and external data. ### Background and Motivation In many scientific studies, especially in the medical field, researchers often need to construct prediction models based on limited internal data sets. However, these internal data sets often have a small sample size, resulting in low estimation efficiency of the model. At this time, if the summary information (such as parameter estimates) from multiple external studies can be effectively utilized, the estimation accuracy and prediction ability of the model can be significantly improved. ### Main Problems 1. **How to effectively integrate external information**: When the internal data set is limited, how to reasonably use the summary information from multiple external studies to improve the estimation efficiency of model parameters. 2. **Handling data distribution heterogeneity**: In the case of differences in the distribution between internal and external data, how to ensure the robustness of model estimates and avoid estimation biases caused by inconsistent data distributions. ### Solutions The paper proposes an adaptive penalty method, and the main steps are as follows: 1. **Parameter re - parameterization**: Through orthogonalization processing, the parameters of the internal model are re - expressed in a parameter form related to external studies. 2. **Constructing the penalty term**: Design a penalty term, which will increase when the external information is inconsistent with the internal data, thereby reducing the dependence on inconsistent external information. 3. **Adaptive weights**: Introduce adaptive weights to dynamically adjust the weight of each external study according to the degree of consistency between the external information and the internal data. 4. **Selecting the optimal tuning parameter**: Use the generalized information criterion (GIC) to select the optimal tuning parameter value to balance the trade - off between the internal data likelihood function and the external information penalty term. ### Mathematical Expressions - **Internal model**: \[ g_I\{E_I[Y|X]\}=\beta_{I0}+\sum_{j = 1}^p\beta_{Ij}X_j \] - **External model**: \[ h^{(k)}\{E^{(k)}[Y|X^{(k)}]\}=\theta^{(k)}_0+\sum_{j\in I^{(k)}}\theta^{(k)}_jX_j \] - **Re - parameterization**: \[ g_I\{E_I[Y|X]\}=\phi^{(k)}_0+\sum_{j\in I^{(k)}}\phi^{(k)}_jX_j+\sum_{\ell\in I_c^{(k)}}\phi^{(k)}_\ell W_\ell \] where \(\phi^{(k)}_j=\beta_{Ij}+\sum_{\ell\in I_c^{(k)}}\beta_{I\ell}\gamma^{(k)}_{j\ell}\), \(\phi^{(k)}_\ell=\beta_{I\ell}\). - **Penalty term**: \[ P_n^{(k)}(\beta_I)=\sum_{r\neq s\in I^{(k)}}\left(\frac{\phi^{(k)}_r(\beta_I)\hat{\theta}^{(k)}_s-\phi^{(k)}_s(\beta_I)\hat{\theta}^{(k)}_r}{\phi^{(k)}_r(\beta_I)\hat{\theta}^{(k)}_s+\phi^{(k)}_s(\beta_I)\hat{\theta}^{(k)}_r}\right)^2 \] - **Adaptive weights**: \[ w_n^{(k)}=\frac{P_n^{(k)}(\h

Robust data integration from multiple external sources for generalized linear models with binary outcomes

Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data

Robust model-based estimation for binary outcomes in genomics studies

Statistical Inference for High-Dimensional Generalized Linear Models With Binary Outcomes

Covariate-adjusted response-adaptive designs for generalized linear models

Multiple-model-based robust estimation of causal treatment effect on a binary outcome with integrated information from secondary outcomes

Heterogeneous Transfer Learning for Building High-Dimensional Generalized Linear Models with Disparate Datasets

A synthetic data integration framework to leverage external summary-level information from heterogeneous populations

Robust Estimation in Generalized Semiparametric Mixed Models for Longitudinal Data

Inverse probability of treatment weighting with generalized linear outcome models for doubly robust estimation

Double robust estimation of partially adaptive treatment strategies

Multivariate probit linear mixed models for multivariate longitudinal binary data

Jewish Religious Ethics Mandate Access to Antiretroviral Drugs in Developing Countries

A penalized robust semiparametric approach for gene-environment interactions

Role of prostaglandins in initiating cardiovascular reflexes originating from the pancreas and the gall bladder.

Robust empirical likelihood inference for generalized partial linear models with longitudinal data

glmmPen: High Dimensional Penalized Generalized Linear Mixed Models

Robust adaptive LASSO in high-dimensional logistic regression

Generalized Partially Linear Models for Incomplete Longitudinal Data in the Presence of Population-Level Information

Adverse Subpopulation Regression for Multivariate Outcomes with High-Dimensional Predictors

Modified generalized method of moments for a robust estimation of polytomous logistic model