Abstract:Variable selection is of crucial significance in QSAR modeling since it increases the model predictive ability and reduces noise. The selection of the right variables is far more complicated than the development of predictive models. In this study, eight continuous and categorical data sets were employed to explore the applicability of two distinct variable selection methods random forests (RF) and least absolute shrinkage and selection operator (LASSO). Variable selection was performed: (1) by using recursive random forests to rule out a quarter of the least important descriptors at each iteration and (2) by using LASSO modeling with 10-fold inner cross-validation to tune its penalty λ for each data set. Along with regular statistical parameters of model performance, we proposed the highest pairwise correlation rate, average pairwise Pearson's correlation coefficient, and Tanimoto coefficient to evaluate the optimal by RF and LASSO in an extensive way. Results showed that variable selection could allow a tremendous reduction of noisy descriptors (at most 96% with RF method in this study) and apparently enhance model's predictive performance as well. Furthermore, random forests showed property of gathering important predictors without restricting their pairwise correlation, which is contrary to LASSO. The mutual exclusion of highly correlated variables in LASSO modeling tends to skip important variables that are highly related to response endpoints and thus undermine the model's predictive performance. The optimal variables selected by RF share low similarity with those by LASSO (e.g., the Tanimoto coefficients were smaller than 0.20 in seven out of eight data sets). We found that the differences between RF and LASSO predictive performances mainly resulted from the variables selected by different strategies rather than the learning algorithms. Our study showed that the right selection of variables is more important than the learning algorithm for modeling. We hope that a standard procedure could be developed based on these proposed statistical metrics to select the truly important variables for model interpretation, as well as for further use to facilitate drug discovery and environmental toxicity assessment.

Why significant variables aren’t automatically good predictors

Penalized Independence Rule for Testing High-Dimensional Hypotheses

Repeated Sieving for Prediction Model Building with High-Dimensional Data

Regression with Highly Correlated Predictors: Variable Omission Is Not the Solution

Instability of Variable-selection Algorithms Used to Identify True Predictors of an Outcome in Intermediate-dimension Epidemiologic Studies

Statistical significance of variables driving systematic variation

Variable importance analysis with interpretable machine learning for fair risk prediction

Generalized Permutation Framework for Testing Model Variable Significance

Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression

On the minimum strength of (unobserved) covariates to overturn an insignificant result

Residuals and Regression Diagnostics: Focusing on Logistic Regression

Variable-Selection Emerges on Top in Empirical Comparison of Whole-Genome Complex-Trait Prediction Methods

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Is Seeing Believing? A Practitioner's Perspective on High-Dimensional Statistical Inference in Cancer Genomics Studies

A Computational Exploration of Emerging Methods of Variable Importance Estimation

Challenges in Variable Importance Ranking Under Correlation

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning

The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models

Testing Predictor Significance with Ultra High Dimensional Multivariate Responses

Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO

Direct causal variable discovery leveraging the invariance principle: application in biomedical studies