Samuel I. Berchuck,Felipe A. Medeiros,Sayan Mukherjee,Andrea Agazzi
Abstract:The generalized linear mixed model (GLMM) is a popular statistical approach
for handling correlated data, and is used extensively in applications areas
where big data is common, including biomedical data settings. The focus of this
paper is scalable statistical inference for the GLMM, where we define
statistical inference as: (i) estimation of population parameters, and (ii)
evaluation of scientific hypotheses in the presence of uncertainty. Artificial
intelligence (AI) learning algorithms excel at scalable statistical estimation,
but rarely include uncertainty quantification. In contrast, Bayesian inference
provides full statistical inference, since uncertainty quantification results
automatically from the posterior distribution. Unfortunately, Bayesian
inference algorithms, including Markov Chain Monte Carlo (MCMC), become
computationally intractable in big data settings. In this paper, we introduce a
statistical inference algorithm at the intersection of AI and Bayesian
inference, that leverages the scalability of modern AI algorithms with
guaranteed uncertainty quantification that accompanies Bayesian inference. Our
algorithm is an extension of stochastic gradient MCMC with novel contributions
that address the treatment of correlated data (i.e., intractable marginal
likelihood) and proper posterior variance estimation. Through theoretical and
empirical results we establish our algorithm's statistical inference
properties, and apply the method in a large electronic health records database.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inefficiency of Bayesian inference in the Generalized Linear Mixed Model (GLMM) in the big data environment. Specifically, the paper focuses on how to achieve scalable statistical inference while ensuring uncertainty quantification when dealing with correlated data (such as repeated measurement data). Traditional Bayesian inference methods, such as Markov Chain Monte Carlo (MCMC), become computationally infeasible in big - data scenarios. Therefore, the author proposes a new method that combines modern artificial intelligence algorithms (especially stochastic gradient descent) with Bayesian inference, aiming to utilize the scalability of the former and the uncertainty quantification ability of the latter.
### Main Contributions
1. **Gradient Estimation**: A Monte Carlo estimator is introduced to estimate the gradient of the marginal log - likelihood, enabling Stochastic Gradient Langevin Dynamics (SGLD) to be applied in the GLMM setting.
2. **Noise Correction**: The noise structure injected into the SGLD update is characterized, and an asymptotic correction of the bias estimate of the posterior covariance is derived in the case of large - data sets.
3. **Empirical Results**: The statistical inference properties of the algorithm are demonstrated through theoretical and empirical results, and the method is applied in a large electronic health record database.
### Background and Notation
- **Data Representation**: Consider a database \(Y=(Y_1,\ldots,Y_n)\) of size \(n\), where \(Y_i\) represents the \(i\)-th observation.
- **Parameter Estimation**: The goal is to estimate the posterior distribution \(\pi(\Omega):=p(\Omega|Y)\propto p(\Omega)\prod_{i = 1}^n p(Y_i|\Omega)\), where \(\Omega\) is the vector of population parameters.
- **Posterior Density**: It can be rewritten as \(\pi(\Omega)\propto\exp\{-f(\Omega)\}\), where \(f(\Omega)=\sum_{i = 0}^n f_i(\Omega)\).
### Stochastic Gradient Langevin Dynamics (SGLD)
- **Langevin Diffusion**: Defined as the stochastic differential equation \(d\Omega_t=-\nabla f(\Omega_t)dt+\sqrt{2}dB_t\), where \(\nabla f(\Omega)\) is the gradient of \(f\) with respect to \(\Omega\), and \(B_t\) is a \(d\)-dimensional Brownian motion.
- **Discretization**: Using the Euler - Maruyama discretization method, we get \(\Omega_{k + 1}=\Omega_k-\epsilon\nabla f(\Omega_k)+\sqrt{2\epsilon}\eta_k\), where \(\eta_k\sim N_d(0,I_d)\).
### Generalized Linear Mixed Model (GLMM)
- **Model Setup**: Each subject \(i\) has \(n_i\) repeated measurements \(Y_{it}\), and \(x_{it}\) and \(z_{it}\) are \(p\)-dimensional and \(q\)-dimensional covariate vectors respectively.
- **Conditional Independence Assumption**: \(Y_{it}\) comes from an exponential family distribution, and its probability density function is \(p(Y_{it}|\theta_{it},\phi)=\exp\left\{\frac{Y_{it}\theta_{it}-b(\theta_{it})}{a(\phi)}+c(Y_{it},\phi)\right\}\).
- **Linear Predictor**: \(\