Abstract:The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientific hypotheses in the presence of uncertainty. Artificial intelligence (AI) learning algorithms excel at scalable statistical estimation, but rarely include uncertainty quantification. In contrast, Bayesian inference provides full statistical inference, since uncertainty quantification results automatically from the posterior distribution. Unfortunately, Bayesian inference algorithms, including Markov Chain Monte Carlo (MCMC), become computationally intractable in big data settings. In this paper, we introduce a statistical inference algorithm at the intersection of AI and Bayesian inference, that leverages the scalability of modern AI algorithms with guaranteed uncertainty quantification that accompanies Bayesian inference. Our algorithm is an extension of stochastic gradient MCMC with novel contributions that address the treatment of correlated data (i.e., intractable marginal likelihood) and proper posterior variance estimation. Through theoretical and empirical results we establish our algorithm's statistical inference properties, and apply the method in a large electronic health records database.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the inefficiency of Bayesian inference in the Generalized Linear Mixed Model (GLMM) in the big data environment. Specifically, the paper focuses on how to achieve scalable statistical inference while ensuring uncertainty quantification when dealing with correlated data (such as repeated measurement data). Traditional Bayesian inference methods, such as Markov Chain Monte Carlo (MCMC), become computationally infeasible in big - data scenarios. Therefore, the author proposes a new method that combines modern artificial intelligence algorithms (especially stochastic gradient descent) with Bayesian inference, aiming to utilize the scalability of the former and the uncertainty quantification ability of the latter. ### Main Contributions 1. **Gradient Estimation**: A Monte Carlo estimator is introduced to estimate the gradient of the marginal log - likelihood, enabling Stochastic Gradient Langevin Dynamics (SGLD) to be applied in the GLMM setting. 2. **Noise Correction**: The noise structure injected into the SGLD update is characterized, and an asymptotic correction of the bias estimate of the posterior covariance is derived in the case of large - data sets. 3. **Empirical Results**: The statistical inference properties of the algorithm are demonstrated through theoretical and empirical results, and the method is applied in a large electronic health record database. ### Background and Notation - **Data Representation**: Consider a database \(Y=(Y_1,\ldots,Y_n)\) of size \(n\), where \(Y_i\) represents the \(i\)-th observation. - **Parameter Estimation**: The goal is to estimate the posterior distribution \(\pi(\Omega):=p(\Omega|Y)\propto p(\Omega)\prod_{i = 1}^n p(Y_i|\Omega)\), where \(\Omega\) is the vector of population parameters. - **Posterior Density**: It can be rewritten as \(\pi(\Omega)\propto\exp\{-f(\Omega)\}\), where \(f(\Omega)=\sum_{i = 0}^n f_i(\Omega)\). ### Stochastic Gradient Langevin Dynamics (SGLD) - **Langevin Diffusion**: Defined as the stochastic differential equation \(d\Omega_t=-\nabla f(\Omega_t)dt+\sqrt{2}dB_t\), where \(\nabla f(\Omega)\) is the gradient of \(f\) with respect to \(\Omega\), and \(B_t\) is a \(d\)-dimensional Brownian motion. - **Discretization**: Using the Euler - Maruyama discretization method, we get \(\Omega_{k + 1}=\Omega_k-\epsilon\nabla f(\Omega_k)+\sqrt{2\epsilon}\eta_k\), where \(\eta_k\sim N_d(0,I_d)\). ### Generalized Linear Mixed Model (GLMM) - **Model Setup**: Each subject \(i\) has \(n_i\) repeated measurements \(Y_{it}\), and \(x_{it}\) and \(z_{it}\) are \(p\)-dimensional and \(q\)-dimensional covariate vectors respectively. - **Conditional Independence Assumption**: \(Y_{it}\) comes from an exponential family distribution, and its probability density function is \(p(Y_{it}|\theta_{it},\phi)=\exp\left\{\frac{Y_{it}\theta_{it}-b(\theta_{it})}{a(\phi)}+c(Y_{it},\phi)\right\}\). - **Linear Predictor**: \(\

Scalable Bayesian inference for the generalized linear mixed model

A Unified Bayesian Inference Framework for Generalized Linear Models

An AMP-Based Low Complexity Generalized Sparse Bayesian Learning Algorithm

Generalised Bayes Linear Inference

Dir-SPGLM: A Bayesian semiparametric GLM with data-driven reference distribution

Approximate inferences for Bayesian hierarchical generalised linear regression models

Parallelization and High-Performance Computing Enables Automated Statistical Inference of Multi-scale Models.

Bayesian estimation for longitudinal data in a joint model with HPCs

A Variational Approach for Modeling High-dimensional Spatial Generalized Linear Mixed Models

A heteroscedastic Bayesian generalized logistic regression model with application to scaling problems

Spike-and-Slab LASSO Generalized Additive Models and Scalable Algorithms for High-Dimensional Data Analysis

Accelerating Generalized Linear Models by Trading off Computation for Uncertainty

Advances in Bayesian model selection consistency for high-dimensional generalized linear models

Scalable Inference for Markov Processes with Intractable Likelihoods

Moment-Based Adjustments of Statistical Inference in High-Dimensional Generalized Linear Models

Empirical Bayes inference in sparse high-dimensional generalized linear models

Estimation and prediction for spatial generalized linear mixed models with parametric links via reparameterized importance sampling

Markov neighborhood regression for statistical inference of high‐dimensional generalized linear models

Scalability of Metropolis-within-Gibbs schemes for high-dimensional Bayesian models

Scalable Bayesian bi-level variable selection in generalized linear models

Inference in generalized bilinear models