Combination of machine learning and COSMO-RS thermodynamic model in predicting solubility parameters of coformers in production of cocrystals for enhanced drug solubility

Wael A. Mahdi,Ahmad J. Obaidullah
DOI: https://doi.org/10.1016/j.chemolab.2024.105219
IF: 4.175
2024-08-30
Chemometrics and Intelligent Laboratory Systems
Abstract:In this study, we develop predictive models for three target variables, denoted as δd , δp , and δh using a dataset with 86 features and 181 samples. The response parameters, which are Hansen solubility parameters, were correlated to input parameters via several machine learning techniques. The input features are molecular descriptors of coformers which are calculated based on COMSO-RS thermodynamic model and group contribution approach. The analysis includes outlier detection via Cook's distance, normalization with a min-max scaler, and feature selection through L1-based methods. Three regression models—Gaussian Process Regression (GPR), Passive Aggressive Regression (PAR), and Polynomial Regression (PR)—are employed, with hyperparameter optimization achieved using Transient Search Optimization (TSO). The results indicate that for δd , the PAR model outperforms others with an R 2 score of 0.885, RMSE of 0.607, MAE of 0.524, and a maximum error of 1.294. The GPR model shows slightly lower performance with an R 2 of 0.872, RMSE of 0.816, MAE of 0.579, and a maximum error of 2.755 for δd . The PR model performs on δd with an R 2 of 0.814, RMSE of 0.923, MAE of 0.597, and a maximum error of 2.814. For δp , the GPR model provides the best performance, achieving an R 2 score of 0.821, RMSE of 1.693, MAE of 1.391, and a maximum error of 3.457. The PAR model performs on δp with an R 2 of 0.740, RMSE of 2.025, MAE of 1.980, and a maximum error of 6.609. Also, The PR model predicts δp with a R 2 of 0.7, RMSE of 2.329, MAE of 2.02, and maximum error of 6.366. Similarly, for δh , the GPR model again shows superior performance with an R 2 score of 0.983, RMSE of 1.243, MAE of 1.005, and a maximum error of 2.577. The PAR model also accurately predicts δh with a R 2 of 0.924, RMSE of 2.713, MAE of 2.416, and maximum error of 6.307. Additionally, the PR model predicts δh with a R 2 of 0.927, RMSE of 2.757, MAE of 2.334, and maximum error of 8.064. These results highlight the efficacy of the chosen models and optimization techniques in accurately predicting the specified outputs, demonstrating significant potential for application in relevant predictive modeling tasks.
automation & control systems,computer science, artificial intelligence,instruments & instrumentation,statistics & probability,mathematics, interdisciplinary applications,chemistry, analytical
What problem does this paper attempt to address?