HerbMet: Enhancing metabolomics data analysis for accurate identification of Chinese herbal medicines using deep learning

Yuyang Sha,Meiting Jiang,Gang Luo,Weiyu Meng,Xiaobing Zhai,Hongxin Pan,Junrong Li,Yan Yan,Yongkang Qiao,Wenzhi Yang,Kefeng Li
DOI: https://doi.org/10.1002/pca.3437
2024-08-21
Abstract:Introduction: Chinese herbal medicines have been utilized for thousands of years to prevent and treat diseases. Accurate identification is crucial since their medicinal effects vary between species and varieties. Metabolomics is a promising approach to distinguish herbs. However, current metabolomics data analysis and modeling in Chinese herbal medicines are limited by small sample sizes, high dimensionality, and overfitting. Objectives: This study aims to use metabolomics data to develop HerbMet, a high-performance artificial intelligence system for accurately identifying Chinese herbal medicines, particularly those from different species of the same genus. Methods: We propose HerbMet, an AI-based system for accurately identifying Chinese herbal medicines. HerbMet employs a 1D-ResNet architecture to extract discriminative features from input samples and uses a multilayer perceptron for classification. Additionally, we design the double dropout regularization module to alleviate overfitting and improve model's performance. Results: Compared to 10 commonly used machine learning and deep learning methods, HerbMet achieves superior accuracy and robustness, with an accuracy of 0.9571 and an F1-score of 0.9542 for distinguishing seven similar Panax ginseng species. After feature selection by 25 different feature ranking techniques in combination with prior knowledge, we obtained 100% accuracy and an F1-score for discriminating P. ginseng species. Furthermore, HerbMet exhibits acceptable inference speed and computational costs compared to existing approaches on both CPU and GPU. Conclusions: HerbMet surpasses existing solutions for identifying Chinese herbal medicines species. It is simple to use in real-world scenarios, eliminating the need for feature ranking and selection in classical machine learning-based methods.
What problem does this paper attempt to address?