Predicting the Pathway Involvement of Metabolites in Both Pathway Categories and Individual Pathways

Erik D Huckvale,Hunter N.B. Moseley
DOI: https://doi.org/10.1101/2024.08.07.607025
2024-08-09
Abstract:Metabolism is the network of chemical reactions that sustain cellular life. Parts of this metabolic network are defined as metabolic pathways containing specific biochemical reactions. Products and reactants of these reactions are called metabolites, which are associated with certain human-defined metabolic pathways. Metabolic knowledgebases, such as the Kyoto Encyclopedia of Gene and Genomes (KEGG) contain metabolites, reactions, and pathway annotations; however, such resources are incomplete due to current limits of metabolic knowledge. To fill in missing metabolite pathway annotations, past machine learning models showed some success at predicting KEGG Level 2 pathway category involvement of metabolites based on their chemical structure. Here, we present the first machine learning model to predict metabolite association to more granular KEGG Level 3 metabolic pathways. We used a feature and dataset engineering approach to generate over one million metabolite-pathway entries in the dataset used to train a single binary classifier. This approach produced a mean Matthews correlation coefficient (MCC) of 0.806 +/- 0.017 SD across 100 cross-validations iterations. The 172 Level 3 pathways were predicted with an overall MCC of 0.726. Moreover, metabolite association with the 12 Level 2 pathway categories were predicted with an overall MCC of 0.891, representing significant transfer learning from the Level 3 pathway entries. These are the best metabolite-pathway prediction results published so far in the field.
Systems Biology
What problem does this paper attempt to address?
The paper mainly addresses the following issues: 1. **Predicting the association between metabolites and metabolic pathways**: Researchers have developed a machine learning model to predict the association between metabolites and the metabolic pathways defined in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Specifically, the model can predict the association between metabolites and the more detailed KEGG Level 3 metabolic pathways. 2. **Integrating predictions at different levels of metabolic pathways**: Previous studies typically focused on predicting the "metabolism" category at KEGG Level 2. This paper further expands this field by predicting not only Level 2 categories but also the more specific Level 3 metabolic pathways. By combining data from both Level 2 and Level 3 pathways to train the model, the prediction performance is significantly improved. 3. **Dataset construction and feature engineering**: To achieve the above goals, the authors adopted a novel dataset construction method by pairing metabolite features with pathway features through a "cross-linking" technique, generating a dataset containing over 1 million metabolite-pathway entries. This method greatly expanded the amount of data available for training the model and enabled a single binary classifier to handle various metabolite-pathway mappings. 4. **Model optimization and evaluation**: A Multi-Layer Perceptron (MLP) was used as the model for training, and its performance was evaluated through cross-validation. The results show that the overall performance of the model is better when considering both Level 2 and Level 3 pathways, with the improvement being particularly significant for Level 3 pathways. 5. **Model performance analysis**: By analyzing metabolic pathways of different sizes, it was found that the model performs better when predicting larger-scale metabolic pathways. Additionally, by filtering out smaller pathways, the prediction accuracy for larger pathways can be further improved. In summary, this paper effectively predicts the association between metabolites and metabolic pathways by developing a new machine learning model, achieving significant progress especially in predicting the more detailed KEGG Level 3 pathways. At the same time, through dataset construction, feature engineering, and model optimization, the prediction performance is improved, providing a valuable tool for research in the biomedical field.