Melting point prediction of organic molecules by deciphering the chemical structure into a natural language

Weiming Mi,Huijun Chen,Donghua (Alan) Zhu,Tao Zhang,Feng Qian
DOI: https://doi.org/10.1039/d0cc07384a
IF: 4.9
2021-01-01
Chemical Communications
Abstract:Establishing quantitative structure-property relationships for the rational design of small molecule drugs at the early discovery stage is highly desirable. Using natural language processing (NLP), we proposed a machine learning model to process the line notation of small organic molecules, allowing the prediction of their melting points. The model prediction accuracy benefits from training upon different canonicalized SMILES forms of the same molecules and does not decrease with increasing size, complexity, and structural flexibility. When a combination of two different canonicalized SMILES forms is used to train the model, the prediction accuracy improves. Largely distinguished from the previous fragment-based or descriptor-based models, the prediction accuracy of this NLP-based model does not decrease with increasing size, complexity, and structural flexibility of molecules. By representing the chemical structure as a natural language, this NLP-based model offers a potential tool for quantitative structure-property prediction for drug discovery and development.
What problem does this paper attempt to address?