Variational Bayes latent class approach for EHR-based phenotyping with large real-world data

Brian Buckley,Adrian O'Hagan,Marie Galligan
2023-04-08
Abstract:Bayesian approaches to clinical analyses for the purposes of patient phenotyping have been limited by the computational challenges associated with applying the Markov-Chain Monte-Carlo (MCMC) approach to large real-world data. Approximate Bayesian inference via optimization of the variational evidence lower bound, often called Variational Bayes (VB), has been successfully demonstrated for other applications. We investigate the performance and characteristics of currently available R and Python VB software for variational Bayesian Latent Class Analysis (LCA) of realistically large real-world observational data. We used a real-world data set, OptumTM electronic health records (EHR), containing pediatric patients with risk indicators for type 2 diabetes mellitus that is a rare form in pediatric patients. The aim of this work is to validate a Bayesian patient phenotyping model for generality and extensibility and crucially that it can be applied to a realistically large real-world clinical data set. We find currently available automatic VB methods are very sensitive to initial starting conditions, model definition, algorithm hyperparameters and choice of gradient optimiser. The Bayesian LCA model was challenging to implement using VB but we achieved reasonable results with very good computational performance compared to MCMC.
Applications
What problem does this paper attempt to address?
The paper attempts to address the computational challenges faced when applying Bayesian Latent Class Analysis (LCA) for patient phenotyping on large real-world clinical datasets. Specifically, traditional Markov Chain Monte Carlo (MCMC) methods encounter computational difficulties when handling large-scale real-world data, whereas Variational Bayes (VB) methods, as an approximate Bayesian inference technique, excel in optimizing the variational lower bound and theoretically can improve computational efficiency. The goal of the paper is to validate a general and scalable Bayesian LCA patient phenotyping model and demonstrate its applicability to real large-scale clinical datasets. The study uses a real dataset from Optum electronic health records, which includes pediatric patients at risk for type 2 diabetes, to verify the model's effectiveness and applicability. Additionally, the research explores the high sensitivity of currently available automated VB methods to initial conditions, model definitions, algorithm hyperparameters, and the choice of gradient optimizers during implementation. Overall, the paper aims to investigate whether the VB method can effectively implement the Bayesian LCA model in large-scale real-world datasets and achieve similar posterior approximation effects compared to MCMC methods.