Abstract:Comprehensive protein function annotation is essential for understanding microbiome-related disease mechanisms in the host organisms. However, a large portion of human gut microbial proteins lack functional annotation. Here, we have developed a new metagenome analysis workflow integrating de novo genome reconstruction, taxonomic profiling, and deep learning-based functional annotations from DeepFRI. This is the first approach to apply deep learning-based functional annotations in metagenomics. We validate DeepFRI functional annotations by comparing them to orthology-based annotations from eggNOG on a set of 1,070 infant metagenomes from the DIABIMMUNE cohort. Using this workflow, we generated a sequence catalogue of 1.9 million nonredundant microbial genes. The functional annotations revealed 70% concordance between Gene Ontology annotations predicted by DeepFRI and eggNOG. DeepFRI improved the annotation coverage, with 99% of the gene catalogue obtaining Gene Ontology molecular function annotations, although they are less specific than those from eggNOG. Additionally, we constructed pangenomes in a reference-free manner using high-quality metagenome-assembled genomes (MAGs) and analyzed the associated annotations. eggNOG annotated more genes on well-studied organisms, such as Escherichia coli, while DeepFRI was less sensitive to taxa. Further, we show that DeepFRI provides additional annotations in comparison to the previous DIABIMMUNE studies. This workflow will contribute to novel understanding of the functional signature of the human gut microbiome in health and disease as well as guiding future metagenomics studies. IMPORTANCE The past decade has seen advancement in high-throughput sequencing technologies resulting in rapid accumulation of genomic data from microbial communities. While this growth in sequence data and gene discovery is impressive, the majority of microbial gene functions remain uncharacterized. The coverage of functional information coming from either experimental sources or inferences is low. To solve these challenges, we have developed a new workflow to computationally assemble microbial genomes and annotate the genes using a deep learning-based model DeepFRI. This improved microbial gene annotation coverage to 1.9 million metagenome-assembled genes, representing 99% of the assembled genes, which is a significant improvement compared to 12% Gene Ontology term annotation coverage by commonly used orthology-based approaches. Importantly, the workflow supports pangenome reconstruction in a reference-free manner, allowing us to analyze the functional potential of individual bacterial species. We therefore propose this alternative approach combining deep-learning functional predictions with the commonly used orthology-based annotations as one that could help us uncover novel functions observed in metagenomic microbiome studies.

Human-in-the-loop approach to identify functionally important residues of proteins from literature

Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

FuncFetch: An LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts

Machine learning for discovering missing or wrong protein function annotations

Functional Site Discovery from Incomplete Training Data: A Case Study with Nucleic Acid–Binding Proteins

Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families

ProtNote: a multimodal method for protein-function annotation

Decoding proteome functional information in model organisms using protein language models.

Piecing together the structure-function puzzle: experiences in structure-based functional annotation of hypothetical proteins.

Evaluating large language models for annotating proteins

Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT

Comprehensive Functional Annotation of Metagenomes and Microbial Genomes Using a Deep Learning-Based Method

Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge

Integrating curation into scientific publishing to train AI models

Functional profiling of the sequence stockpile: a review and assessment of prediction tools

Automated assembly of molecular mechanisms at scale from text mining and curated databases

Explainable protein function annotation using local structure embeddings

AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding

Recursive Cleaning for Large-scale Protein Data via Multimodal Learning

An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement in Accuracy.

Comprehensive assessment of protein loop modeling programs on large-scale datasets: prediction accuracy and efficiency