Gaia: A Context-Aware Sequence Search and Discovery Tool for Microbial Proteins

Nishant Jha,Joshua Kravitz,Jacob West-Roberts,Antonio Camargo,Simon Roux,Andre Cornman,Yunha Hwang
DOI: https://doi.org/10.1101/2024.11.19.624387
2024-11-21
Abstract:Protein sequence similarity search is fundamental to genomics research, but current methods are typically not able to consider crucial genomic context information that can be indicative of protein function, especially in microbial systems. Here we present Gaia (Genomic AI Annotator), a sequence annotation platform that enables rapid, context-aware protein sequence search across genomic datasets. Gaia leverages gLM2, a mixed-modality genomic language model trained on both amino acid sequences and their genomic neighborhoods to generate embeddings that integrate sequence-structure-context information. This approach allows for the identification of functionally related genes that are found in conserved genomic contexts, which may be missed by traditional sequence- or structure-based search alone. Gaia enables real-time search of a curated database comprising over 85M protein clusters (defined at 90% sequence identity) from 131,744 microbial genomes. We compare the sequence, structure and context sensitivity of gLM2 embedding-based search against existing tools like MMseqs2 and Foldseek. We showcase Gaia-enabled discoveries of phage tail proteins and siderophore synthesis loci that were previously difficult to annotate with traditional tools. Gaia search is freely available at https://gaia.tatta.bio
Biology
What problem does this paper attempt to address?