Deciphering enzymatic potential in metagenomic reads through DNA language model

Prabakaran R,Yana Bromberg
DOI: https://doi.org/10.1101/2024.12.10.627786
2024-12-11
Abstract:The microbial world plays a fundamental role in shaping Earth's biosphere, steering global processes such as carbon and nitrogen cycling, soil rejuvenation, and ecological fortification. An overwhelming majority of microbial entities, however, remain unstudied. Metagenomics stands to elucidate this microbial dark matter by directly sequencing the microbial community DNA from environmental samples. Yet, our ability to explore these metagenomic sequences is limited to establishing their similarity to curated datasets of organisms or genes/proteins. Aside from the difficulties in establishing such similarity, the reference-based approaches, by definition, forgo discovery of any entities sufficiently unlike the reference collection. Presenting a paradigm shift, language model-based methods, offer promising avenues for reference-free analysis of metagenomic reads. Here, we introduce two language models, a pretrained foundation model REMME, aimed at understanding the DNA context of metagenomic reads, and the fine-tuned REBEAN model for predicting the enzymatic potential encoded within the read-corresponding genes. By emphasizing function over gene identification, REBEAN is able to label known functions carried both by previously explored genes and by new (orphan) sequences. Furthermore, even though it is not explicitly trained to do so, REBEAN identifies the functionally relevant parts of a gene. Our comprehensive analysis highlights our models' potential for metagenomic read annotation and unearthing of novel enzymes, thus enriching our understanding of microbial communities.
Bioinformatics
What problem does this paper attempt to address?