Dimensionally reduced machine learning model for predicting single component octanol-water partition coefficients

David H Kenney,Randy C Paffenroth,Michael T Timko,Andrew R Teixeira
DOI: https://doi.org/10.1186/s13321-022-00660-1
2023-01-19
Abstract:MF-LOGP, a new method for determining a single component octanol-water partition coefficients ([Formula: see text]) is presented which uses molecular formula as the only input. Octanol-water partition coefficients are useful in many applications, ranging from environmental fate and drug delivery. Currently, partition coefficients are either experimentally measured or predicted as a function of structural fragments, topological descriptors, or thermodynamic properties known or calculated from precise molecular structures. The MF-LOGP method presented here differs from classical methods as it does not require any structural information and uses molecular formula as the sole model input. MF-LOGP is therefore useful for situations in which the structure is unknown or where the use of a low dimensional, easily automatable, and computationally inexpensive calculations is required. MF-LOGP is a random forest algorithm that is trained and tested on 15,377 data points, using 10 features derived from the molecular formula to make [Formula: see text] predictions. Using an independent validation set of 2713 data points, MF-LOGP was found to have an average [Formula: see text] = 0.77 ± 0.007, [Formula: see text] = 0.52 ± 0.003, and [Formula: see text] = 0.83 ± 0.003. This performance fell within the spectrum of performances reported in the published literature for conventional higher dimensional models ([Formula: see text] = 0.42-1.54, [Formula: see text] = 0.09-1.07, and [Formula: see text] = 0.32-0.95). Compared with existing models, MF-LOGP requires a maximum of ten features and no structural information, thereby providing a practical and yet predictive tool. The development of MF-LOGP provides the groundwork for development of more physical prediction models leveraging big data analytical methods or complex multicomponent mixtures.
What problem does this paper attempt to address?