LigEGFR: Spatial graph embedding and molecular descriptors assisted bioactivity prediction of ligand molecules for epidermal growth factor receptor on a cell line-based dataset

Puri Virakarin,Natthakan Saengnil,Bundit Boonyarit,Jiramet Kinchagawat,Rattasat Laotaew,Treephop Saeteng,Thanasan Nilsu,Naravut Suvannang,Thanyada Rungrotmongkol,Sarana Nutanong
DOI: https://doi.org/10.1101/2020.12.24.423424
2020-12-24
Abstract:A bstract Motivation Lung cancer is a chronic non-communicable disease and is the cancer with the world’s highest incidence in the 21 st century. One of the leading mechanisms underlying the development of lung cancer in nonsmokers is an amplification of the epidermal growth factor receptor (EGFR) gene. However, laboratories employing conventional processes of drug discovery and development for such targets encounter several pain-points that are cost- and time-consuming. Moreover, high failure rates are caused by efficacy and safety problems during research and development. Therefore, it is imperative to develop improved methods for drug discovery. Herein, we developed a deep learning model with spatial graph embedding and molecular descriptors based on predicting pIC 50 potency estimates of small molecules and classifying hit compounds against the human epidermal growth factor receptor (LigEGFR). The model was generated with a large-scale cell line-based dataset containing broad lists of chemical features. Results LigEGFR outperformed baseline machine learning models for predicting pIC 50 . Our model was notable for higher performance in hit compound classification, compared to molecular docking and machine learning approaches. The proposed predictive model provides a powerful strategy that potentially helps researchers overcome major challenges in drug discovery and development processes, leading to a reduction of failure to discover novel hit compounds. Availability We provide an online prediction platform and the source code that are freely available at https://ligegfr.vistec.ist , and https://github.com/scads-biochem/LigEGFR , respectively. Key points LigEGFR is a regression model for predicting pIC 50 that was developed for the human EGFR target. It can also be applied to hit compound classification (pIC 50 ≥ 6) and has a higher performance than baseline machine learning algorithms and molecular docking approaches. Our spatial graph embedding and molecular descriptors based approach notably exhibited a high performance in predicting pIC 50 of small molecules against human EGFR. Non-hashed and hashed molecular descriptors were revealed to have the highest predictive performance by using in a convolutional layers and a fully connected layers, respectively. Our model used a large-scale and non-redundant dataset to enhance the diversity of the small molecules. The model showed robustness and reliability, which was evaluated by y-randomization and applicability domain analysis (ADAN), respectively. We developed a user-friendly online platform to predict pIC 50 of small molecules and classify the hit compounds for the drug discovery process of the EGFR target.
What problem does this paper attempt to address?