Ensemble partial least squares regression for descriptor selection, outlier detection, applicability domain assessment, and ensemble modeling in QSAR/QSPR modeling

Dong-Sheng Cao,Zhen-Ke Deng,Min-Feng Zhu,Zhi-Jiang Yao,Jie Dong,Rui-Gang Zhao
DOI: https://doi.org/10.1002/cem.2922
IF: 2.5
2017-01-01
Journal of Chemometrics
Abstract:In QSAR/QSPR modeling, building an accurate partial least squares (PLS) model usually involves descriptor selection, outlier detection, applicability domain assessment, nonlinear relationship, and model stability problems. In the present study, we presented an ensemble PLS (EnPLS) method for solving these modeling tasks under a unified methodology framework. EnPLS aims at developing a consistent algorithmic framework by means of the idea of ensemble learning and statistical distribution. The approach exploits the fact that the distribution of PLS model coefficients provides a mechanism for ranking and interpreting the effects of variables, whereas the distribution of prediction errors provides a mechanism for differentiating the outliers from normal samples and assessing the applicability domain of models. The use of statistics of these distributions, namely, mean/median value and standard deviation, inherently provides a feasible way to effectively describe the information contained by the original samples. Furthermore, ensemble modeling and prediction based on several cross-predictive PLS models could effectively improve the model prediction performance and increase the model stability to a certain extent. The aqueous solubility data are used to demonstrate the ability of our proposed EnPLS method in solving various modeling tasks such as descriptor selection, outlier detection, applicability domain assessment, performance improvement, and model stability. Finally, a freely available R package implementing EnPLS is developed to facilitate the use of chemists and pharmacologists. The R package is freely available at . In the present study, we presented an ensemble PLS method for solving these modeling tasks under a unified methodology framework. EnPLS aims at developing a consistent algorithmic framework by means of the idea of ensemble learning and statistical distribution. The use of statistics of these distributions inherently provides a feasible way to effectively describe the information contained by the original samples.
What problem does this paper attempt to address?