Scalable Bayesian inference for the generalized linear mixed model

Samuel I. Berchuck,Felipe A. Medeiros,Sayan Mukherjee,Andrea Agazzi
2024-03-05
Abstract:The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientific hypotheses in the presence of uncertainty. Artificial intelligence (AI) learning algorithms excel at scalable statistical estimation, but rarely include uncertainty quantification. In contrast, Bayesian inference provides full statistical inference, since uncertainty quantification results automatically from the posterior distribution. Unfortunately, Bayesian inference algorithms, including Markov Chain Monte Carlo (MCMC), become computationally intractable in big data settings. In this paper, we introduce a statistical inference algorithm at the intersection of AI and Bayesian inference, that leverages the scalability of modern AI algorithms with guaranteed uncertainty quantification that accompanies Bayesian inference. Our algorithm is an extension of stochastic gradient MCMC with novel contributions that address the treatment of correlated data (i.e., intractable marginal likelihood) and proper posterior variance estimation. Through theoretical and empirical results we establish our algorithm's statistical inference properties, and apply the method in a large electronic health records database.
Machine Learning,Computation,Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inefficiency of Bayesian inference in the Generalized Linear Mixed Model (GLMM) in the big data environment. Specifically, the paper focuses on how to achieve scalable statistical inference while ensuring uncertainty quantification when dealing with correlated data (such as repeated measurement data). Traditional Bayesian inference methods, such as Markov Chain Monte Carlo (MCMC), become computationally infeasible in big - data scenarios. Therefore, the author proposes a new method that combines modern artificial intelligence algorithms (especially stochastic gradient descent) with Bayesian inference, aiming to utilize the scalability of the former and the uncertainty quantification ability of the latter. ### Main Contributions 1. **Gradient Estimation**: A Monte Carlo estimator is introduced to estimate the gradient of the marginal log - likelihood, enabling Stochastic Gradient Langevin Dynamics (SGLD) to be applied in the GLMM setting. 2. **Noise Correction**: The noise structure injected into the SGLD update is characterized, and an asymptotic correction of the bias estimate of the posterior covariance is derived in the case of large - data sets. 3. **Empirical Results**: The statistical inference properties of the algorithm are demonstrated through theoretical and empirical results, and the method is applied in a large electronic health record database. ### Background and Notation - **Data Representation**: Consider a database \(Y=(Y_1,\ldots,Y_n)\) of size \(n\), where \(Y_i\) represents the \(i\)-th observation. - **Parameter Estimation**: The goal is to estimate the posterior distribution \(\pi(\Omega):=p(\Omega|Y)\propto p(\Omega)\prod_{i = 1}^n p(Y_i|\Omega)\), where \(\Omega\) is the vector of population parameters. - **Posterior Density**: It can be rewritten as \(\pi(\Omega)\propto\exp\{-f(\Omega)\}\), where \(f(\Omega)=\sum_{i = 0}^n f_i(\Omega)\). ### Stochastic Gradient Langevin Dynamics (SGLD) - **Langevin Diffusion**: Defined as the stochastic differential equation \(d\Omega_t=-\nabla f(\Omega_t)dt+\sqrt{2}dB_t\), where \(\nabla f(\Omega)\) is the gradient of \(f\) with respect to \(\Omega\), and \(B_t\) is a \(d\)-dimensional Brownian motion. - **Discretization**: Using the Euler - Maruyama discretization method, we get \(\Omega_{k + 1}=\Omega_k-\epsilon\nabla f(\Omega_k)+\sqrt{2\epsilon}\eta_k\), where \(\eta_k\sim N_d(0,I_d)\). ### Generalized Linear Mixed Model (GLMM) - **Model Setup**: Each subject \(i\) has \(n_i\) repeated measurements \(Y_{it}\), and \(x_{it}\) and \(z_{it}\) are \(p\)-dimensional and \(q\)-dimensional covariate vectors respectively. - **Conditional Independence Assumption**: \(Y_{it}\) comes from an exponential family distribution, and its probability density function is \(p(Y_{it}|\theta_{it},\phi)=\exp\left\{\frac{Y_{it}\theta_{it}-b(\theta_{it})}{a(\phi)}+c(Y_{it},\phi)\right\}\). - **Linear Predictor**: \(\