Abstract:Statistical modelling of covariate distributions allows to generate virtual populations or to impute missing values in a covariate dataset. Covariate distributions typically have non-Gaussian margins and show nonlinear correlation structures, which simple multivariate Gaussian distributions fail to represent. Prominent non-Gaussian frameworks for covariate distribution modelling are copula-based models and models based on multiple imputation by chained equations (MICE). While both frameworks have already found applications in the life sciences, a systematic investigation of their goodness-of-fit to the theoretical underlying distribution, indicating strengths and weaknesses under different conditions, is still lacking. To bridge this gap, we thoroughly evaluated covariate distribution models in terms of Kullback-Leibler divergence (KL-D), a scale-invariant information-theoretic goodness-of-fit criterion for distributions. Methodologically, we proposed a new approach to construct confidence intervals for KL-D by combining nearest neighbour-based KL-D estimators with subsampling-based uncertainty quantification. In relevant data sets of different sizes and dimensionalities with both continuous and discrete covariates, non-Gaussian models showed consistent improvements in KL-D, compared to simpler Gaussian or scale transform approximations. KL-D estimates were also robust to the inclusion of latent variables and large fractions of missing values. While good generalization behaviour to new data could be seen in copula-based models, MICE shows a trend for overfitting and its performance should always be evaluated on separate test data. Parametric copula models and MICE were found to scale much better with the dataset dimension than nonparametric copula models. These findings corroborate the potential of non-Gaussian models for modelling realistic life science covariate distributions.

Information-theoretic evaluation of covariate distributions models

Clustering and Prediction with Variable Dimension Covariates

A note on numerical evaluation of conditional Akaike information for nonlinear mixed-effects models

Statistical Testing of Covariate Effects in Conditional Copula Models

Minimum Profile Hellinger Distance Estimation of General Covariate Models

High-dimensional copula variational approximation through transformation

Gaussian dependence structure pairwise goodness-of-fit testing based on conditional covariance and the 20/60/20 rule

Efficient Algorithms for Covariate Analysis with Dynamic Data Using Nonlinear Mixed-Effects Model.

Asymptotically Exact and Fast Gaussian Copula Models for Imputation of Mixed Data Types

High-Dimensional Gaussian Graphical Regression Models with Covariates

Algorithm xxx: A Covariate-Dependent Approach to Gaussian Graphical Modeling in R

Introducing Gaussian covariance graph models in genome-wide prediction

Statistical and Computational Trade-offs in Variational Inference: A Case Study in Inferential Model Selection

Copula Gaussian graphical models and their application to modeling functional disability data

Covariance Model with General Linear Structure and Divergent Parameters

Regression Copulas for Multivariate Responses

Covariate shift in nonparametric regression with Markovian design

A convex formulation of covariate-adjusted Gaussian graphical models via natural parametrization

A regression framework for assessing covariate effects on the reproducibility of high-throughput experiments.

Gaussian Processes with Errors in Variables: Theory and Computation

Conditional Copula Models for Right-Censored Clustered Event Time Data