Machine learning-assisted search for novel coagulants: when machine learning can be efficient even if data availability is low

Andrij Rovenchak,Maksym Druchok
DOI: https://doi.org/10.1002/jcc.27292
2024-01-04
Abstract:Design of new drugs is a challenging process: a candidate molecule should satisfy multiple conditions to act properly and make the least side-effect -- perfect candidates selectively attach to and influence only targets, leaving off-targets intact. The amount of experimental data about various properties of molecules constantly grows, promoting data-driven approaches. However, the applicability of typical predictive machine learning techniques can be substantially limited by a lack of experimental data about a particular target. For example, there are many known Thrombin inhibitors (acting as anticoagulants), but a very limited number of known Protein C inhibitors (coagulants). In this study, we present our approach to suggest new inhibitor candidates by building an effective representation of chemical space. For this aim, we developed a deep learning model -- autoencoder, trained on a large set of molecules in the SMILES format to map the chemical space. Further, we applied different sampling strategies to generate novel coagulant candidates. Symmetrically, we tested our approach on anticoagulant candidates, where we were able to predict their inhibition towards Thrombin. We also compare our approach with MegaMolBART -- another deep learning generative model, but exploiting similar principles of navigation in a chemical space.
Biomolecules
What problem does this paper attempt to address?
The paper attempts to address the problem of finding new coagulants through machine learning methods in situations where data availability is low. Specifically, the paper focuses on how to generate new coagulant candidate molecules using machine learning techniques when the number of known coagulants is very limited. The paper mentions that although there are many known anticoagulants (such as thrombin inhibitors), the number of known coagulants (such as protein C inhibitors) is very small. Therefore, traditional predictive machine learning techniques are often limited in this context. To overcome this challenge, the researchers developed a deep learning-based autoencoder model that can map the chemical space to a low-dimensional embedding space. By applying different sampling strategies in this embedding space, the researchers were able to generate new coagulant candidate molecules. Additionally, they tested the performance of this method on anticoagulant candidate molecules and compared it with another deep learning generative model, MegaMolBART. Overall, the paper aims to explore an effective method for generating new drugs using machine learning techniques in data-scarce situations, with a particular focus on the discovery of coagulants.