Microbial general model: Leveraging large language model for contextualized microbiome analysis
Haohong Zhang,Zixin Kang,Yuli Zhang,Lulu Song,Kang Ning,Ronghua Yang
DOI: https://doi.org/10.1101/2024.12.30.630825
2025-01-01
Abstract:Microbial communities significantly impact medicine, biotechnology, and agriculture. Advanced sequencing technologies have generated extensive microbiome data, enabling the discovery of substantial evolutionary and ecological patterns. However, traditional supervised learning methods struggle to capture universal patterns in microbial community data, largely due to the large data heterogeneity and profound batch effects among samples, rendering it difficult to classify samples as well as detect biomarkers from millions of samples, not to say the intricate but important dynamic patterns from a variety of contextualized sceneries. In this study, we propose MGM, a context-aware, attention-based foundation model, pre-trained on a dataset of 263,302 microbiome samples (Microcorpus-260K) via language modeling. MGM demonstrated significant improvements in microbial community classification compared to traditional machine learning methods. Additionally, MGM has enabled contextualized classification, effectively overcomes cross-regional limitations, showing enhanced performance on intercontinental datasets through transfer learning. Furthermore, fine-tuning MGM on a longitudinal infant dataset revealed distinct keystone genera during development, with Bacteroides and Bifidobacterium exhibiting higher attention weights in vaginal deliveries, and Haemophilus in cesarean deliveries. Finally, through in silico modeling, the model also uncovered novel microbial dynamic patterns in a Crohn disease cohort following antibiotic treatment. In conclusion, by leveraging self-attention and autoregressive pre-training, MGM serves as a versatile model for various downstream microbiome tasks and holds significant potential for achieving contextualized aims.
Biology