Will we ever be able to accurately predict solubility?

P. Llompart,C. Minoletti,S. Baybekov,D. Horvath,G. Marcou,A. Varnek
DOI: https://doi.org/10.1038/s41597-024-03105-6
2024-03-19
Scientific Data
Abstract:Accurate prediction of thermodynamic solubility by machine learning remains a challenge. Recent models often display good performances, but their reliability may be deceiving when used prospectively. This study investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the aqueous solubility dataverse and data quality. We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets. We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist. Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources. We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.
multidisciplinary sciences
What problem does this paper attempt to address?
The paper attempts to address the problem of accurately predicting the thermodynamic solubility of compounds. Specifically, although machine learning methods have shown good performance in predicting solubility in recent years, the reliability of these models in practical applications still needs improvement. The research mainly focuses on the following directions: 1. **Historical Perspective**: Reviewing solubility datasets and models published over the past 20 years, exploring overlooked datasets, and examining the overlap between popular datasets. 2. **Data Analysis**: Benchmarking existing solubility datasets and discovering the poor performance of these models. 3. **Data Quality**: Proposing a workflow for handling solubility data, aiming to provide useful models for laboratory chemists. 4. **Model Applicability**: Pointing out that some state-of-the-art models are not yet ready for public use because they lack a clear application domain and ignore historical data sources. The study also analyzes factors affecting the practicality of models, including inter-laboratory standard deviation, the ionic state of solutes, and data sources. The ultimate goal is to improve the accuracy of solubility predictions through high-quality data evaluation and model improvement.