TCMBank: Bridges Between the Largest Herbal Medicines, Chemical Ingredients, Target Proteins, and Associated Diseases with Intelligence Text Mining

Qiujie Lv,Guanxing Chen,Haohuai He,Ziduo Yang,Lu Zhao,Hsin-Yi Chen,Calvin Yu-Chian Chen,Guan-Xing Chen,HaoHuai He
DOI: https://doi.org/10.1039/d3sc02139d
IF: 8.4
2023-08-10
Chemical Science
Abstract:Traditional Chinese medicine (TCM) has long been viewed as precious sources of modern drug discovery. AI-assisted drug discovery (AIDD) has been investigated extensively. However, there are still two challenges in applying AIDD to guide TCM drug discovery: the lack of a large amount of standardized TCM-related information and AIDD is prone to pathological failures in out-of-domain data. We have released TCM Database@Taiwan in 2011, and it has been widely disseminated and used. Now, we developed TCMBank, the largest systematic free TCM database, which is an extension of TCM Database@Taiwan. TCMBank contains 9192 herbs, 61,966 ingredients (unduplicated), 15,179 targets, 32,529 diseases, and their pairwise relationships. By integrating multiple data sources, TCMBank provides 3D structure information of ingredients, and provides standard list and detailed information of herbs, ingredients, targets and diseases. TCMBank has an intelligent document identification module that continuously adds TCM-related information retrieved from literature in PubChem. In addition, driven by TCMBank big data, we developed an ensemble learning-based drug discovery protocol for identifying potential lead and drug repurposing. We take colorectal cancer and Alzheimer's disease as examples to demonstrate how to accelerate drug discovery by artificial intelligence. Using TCMBank, researchers can view literature-driven relationship mapping between herbs/ingredients and genes/diseases, allowing understanding of molecular action mechanisms for ingredients and identification of new potentially effective treatments. TCMBank is available at https://TCMBank.CN/.
chemistry, multidisciplinary
What problem does this paper attempt to address?
There are two main problems that this paper attempts to solve: 1. **Lack of standardized Traditional Chinese Medicine (TCM) - related information**: One of the important challenges in traditional Chinese medicine research and modern drug development is the lack of a large amount of standardized TCM information. For example, information about active ingredients in herbs, the associations between ingredients and target proteins, etc. This information is scattered in various books and journals and is difficult to comprehensively collect, resulting in it being difficult for researchers to obtain complete data on ingredients and their mechanisms of action. 2. **Pathological failures of Artificial Intelligence - Assisted Drug Discovery (AIDD) on out - of - domain data**: Existing AIDD methods are prone to systematic errors when dealing with out - of - domain data, and most methods lack wet - experiment verification. A single model may be too sensitive or dependent on certain data points, resulting in insufficient generalization ability on new data. ### Solutions To solve the above problems, the research team has developed **TCMBank**, which is a free and systematic TCM database aiming to provide standardized TCM information, including herbs, ingredients, targets, diseases and their inter - relationships. Specifically: - **Features of TCMBank**: - It contains 9,192 kinds of herbs, 61,966 non - repetitive ingredients, 15,179 targets, 32,529 diseases and their pairwise relationships. - It provides 3D structure information of ingredients, facilitating virtual screening and molecular simulation. - The Intelligent Document Identification Module (IDIM) regularly downloads the latest literature from PubChem and extracts TCM - related information through techniques such as Natural Language Processing (NLP) and Optical Character Recognition (OCR) to ensure the continuous update of the database. - **Drug discovery framework based on ensemble learning**: - Use an Ensemble Learning (EL) framework to improve the efficiency of virtual screening, and identify potential effective lead compounds and drug re - use by finding consensus among prediction methods. - Specific steps include: molecular docking, ligand - based EL model, Hybrid Neural Network (HNN) - based EL model to predict Drug - Target Affinity (DTA), and evaluating the kinetic properties and interactions of protein - ligand complexes through Molecular Dynamics (MD) simulations. Through these measures, TCMBank not only provides rich standardized TCM data, but also accelerates the drug discovery process through AI technology, thus promoting the modernization of traditional Chinese medicine.