Molecular Fingerprints Optimization for Enhanced Predictive Modeling

Viktoriia Turkina,Melanie R.W. Messih,Etienne Kant,Jelle Gringhuis,Annemieke Petrignani,Garry Corthals,Jake W. O'Brien,Saer Samanipour
DOI: https://doi.org/10.26434/chemrxiv-2024-zr2vr
2024-02-26
Abstract:The human exposome is represented by a vast number of chemicals, the fate and behavior of which remain largely unexplored. While modeling approaches are commonly employed to address this challenge, there is a recognized need for alternative molecular representations, such as molecular fingerprints. However, existing algorithms for computing molecular fingerprints may incorporate irrelevant or insufficient information for accurate activity prediction. In this study, we present an algorithm designed to optimize molecular fingerprints. This algorithm combines the relevant bits of information, aiming to enrich the final fingerprint for predicting specific behavioral properties. To achieve this, relevant variables (i.e. bits) for prediction were collected from six non-hashed fingerprints and fused into a master fingerprint. We used fish toxicity as a proof of concept. The RFR model was developed based on the master fingerprint. It demonstrated comparable results to conventional descriptor-based models with R$^2$ $\approx 0.9$ for the training set and R$^2$ $\approx 0.6$ for the test set. The molecular fingerprints have the advantage of being consistent and interpretable. Consequently, we were able to confirm the relevance of variables to the toxicity prediction. The final model outperformed each of the models based on individual fingerprints in the number of chemicals with prediction error, that fell in the range of +/- one standard deviation of residuals. The number of cases with the lower prediction error was on average four times higher for the master fingerprint-based model. The algorithm developed for optimizing molecular fingerprints is universal and can be applied to various case studies.
Chemistry
What problem does this paper attempt to address?
The paper mainly discusses how to optimize molecular fingerprints to improve the accuracy of predictive modeling, especially for the behavior and fate of a large number of chemicals in human-exposed environments. Existing molecular fingerprint algorithms may contain irrelevant or insufficient information, resulting in inaccurate activity predictions. The study proposes an algorithm that combines relevant information bits from different non-hashed molecular fingerprints to generate a primary molecular fingerprint, which is used for predicting fish toxicity as a proof of concept. The study found that this optimized molecular fingerprint had a coefficient of determination (R2) of approximately 0.9 for the training set and approximately 0.6 for the test set, outperforming the single fingerprint model and exhibiting better predictive accuracy. This approach is generally applicable and can be applied to various case studies, providing a new tool for addressing the impact of chemicals on the environment and health.