Predicted meta-omics: a potential solution to multi-omics data scarcity in microbiome studies

Bianca Maria Cosma,Stephanie Pillay,David Calderon-Franco,Thomas Abeel
DOI: https://doi.org/10.1101/2024.11.04.621857
2024-11-04
Abstract:Imbalances in the gut microbiome have been linked to conditions such as inflammatory bowel disease, diabetes, and cancer. While metagenomics and amplicon sequencing are commonly used to study the microbiome, they do not capture all layers of microbial functions. Other meta-omics data can provide more insights, but these are more costly and laborious to procure. The growing availability of paired meta-omics data offers an opportunity to develop machine learning models that can infer connections between metagenomics data and other forms of meta-omics data, enabling the prediction of these other forms of meta-omics data from metagenomics. We evaluated several machine learning models for predicting meta-omics features from various meta-omics inputs. Simpler architectures such as elastic net regression and random forests generated reliable predictions of transcript and metabolite abundances, with correlations of up to 0.77 and 0.74, respectively, but predicting protein profiles was more challenging. We also identified a core set of well-predicted features for each meta-omics output type, and showed that multi-output regression neural networks performed similarly when trained using fewer output features. Lastly, our experiments demonstrated that predicted features can be used for the downstream task of inflammatory bowel disease classification, with performance comparable to that of experimental data.
Microbiology
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of the scarcity of multi - omics data in microbiome research. Specifically, the authors explored how to use existing metagenomics data to predict other forms of meta - omics data, such as metatranscriptomics, metaproteomics and metabolomics. Solving these problems is crucial for a more comprehensive understanding of the functional activities of microbial communities and their relationship with host health. #### Main problems: 1. **Scarcity of multi - omics data**: - Although metagenomics data are easy to obtain and cost - effective, they can only reflect the gene composition of microbial communities and cannot capture all the information at the functional level. - Other meta - omics data (such as metatranscriptomics, metaproteomics and metabolomics) can provide more in - depth functional information, but the acquisition of these data is costly and the experiments are complex. 2. **Possibility of predicting other meta - omics data**: - Predicting other meta - omics data from metagenomics data through machine - learning models can make up for the problem of data scarcity and provide more functional information for subsequent research. 3. **Validating the practicality of predicted data**: - Researchers not only focus on the accuracy of prediction models, but also evaluate the performance of these predicted data in practical applications, such as for inflammatory bowel disease (IBD) classification tasks. #### Method overview: - **Dataset selection**: Use multiple publicly available datasets, including IBDMDB (Inflammatory Bowel Disease Multi - omics Database), covering metagenomics, metatranscriptomics, metaproteomics and metabolomics data. - **Feature filtering and transformation**: Filter sparse features and adopt standardized data transformation methods (such as CLR transformation, arcsin square root transformation, etc.) to handle compositional data. - **Model training and evaluation**: Benchmark multiple machine - learning models, including Elastic Net Regression, Random Forest, SparseNED and Deep NN, and evaluate their performance on different input - output combinations. - **Downstream application**: Apply the predicted meta - omics data to IBD classification tasks to verify their practical application effects. #### Key findings: - The Elastic Net Regression and Random Forest models perform well in predicting transcript and metabolite abundances, with correlation coefficients reaching 0.77 and 0.74 respectively. - The prediction of protein abundances is more difficult, with an average correlation coefficient of only about 0.4. - The predicted meta - omics data can achieve performance comparable to experimental data in IBD classification tasks, indicating that these predicted data have practical application value. Through these studies, the authors demonstrated the potential of using machine - learning models to predict other meta - omics data from metagenomics data, providing new tools and methods for microbiome research.