Predicting the Pathway Involvement of Metabolites Based on Combined Metabolite and Pathway Features

Huckvale,Moseley
DOI: https://doi.org/10.3390/metabo14050266
IF: 4.1
2024-05-07
Metabolites
Abstract:A major limitation of most metabolomics datasets is the sparsity of pathway annotations for detected metabolites. It is common for less than half of the identified metabolites in these datasets to have a known metabolic pathway involvement. Trying to address this limitation, machine learning models have been developed to predict the association of a metabolite with a "pathway category", as defined by a metabolic knowledge base like KEGG. Past models were implemented as a single binary classifier specific to a single pathway category, requiring a set of binary classifiers for generating the predictions for multiple pathway categories. This past approach multiplied the computational resources necessary for training while diluting the positive entries in the gold standard datasets needed for training. To address these limitations, we propose a generalization of the metabolic pathway prediction problem using a single binary classifier that accepts the features both representing a metabolite and representing a pathway category and then predicts whether the given metabolite is involved in the corresponding pathway category. We demonstrate that this metabolite–pathway features pair approach not only outperforms the combined performance of training separate binary classifiers but demonstrates an order of magnitude improvement in robustness: a Matthews correlation coefficient of 0.784 ± 0.013 versus 0.768 ± 0.154.
biochemistry & molecular biology
What problem does this paper attempt to address?
The paper aims to address the issue of scarcity in metabolic pathway annotations in metabolomics data. Specifically, the main limitation encountered in the study is the very sparse metabolic pathway annotations for identified metabolites. Typically, in metabolomics datasets, less than half of the identified metabolites have known information about their involvement in metabolic pathways. To tackle this challenge, researchers have developed machine learning models to predict the association of metabolites with specific "metabolic pathway categories." The models used in past studies were binary classifiers targeted at single metabolic pathway categories, which meant that a separate classifier had to be trained for each pathway category. This approach not only increased the required computational resources but also made the positive entries in the gold standard datasets used for training scarce. To address these limitations, this paper proposes a new approach: using a single binary classifier to accept features representing both metabolites and metabolic pathway categories and predict whether a given metabolite is involved in the corresponding metabolic pathway category. The key contributions of the paper include: 1. Proposing a generalized solution to the metabolic pathway prediction problem by using a single binary classifier to handle pairs of metabolite features and pathway features. 2. Demonstrating that this metabolite-pathway feature pair approach not only outperforms the combined performance of multiple individually trained binary classifiers but also significantly improves robustness (Matthews correlation coefficient increased from 0.768 ± 0.154 to 0.784 ± 0.013). 3. Utilizing an improved benchmark dataset that includes 5683 metabolites with pathway annotations and chemical structure representations. 4. Exploring the potential of using autoencoders for feature dimensionality reduction to further decrease computational resource requirements. In summary, this study effectively addresses the limitations of existing models by proposing a new prediction framework, thereby improving prediction performance and model generalizability.