Improvements in viral gene annotation using large language models and soft alignments

William L. Harrigan,Barbra D. Ferrell,K. Eric Wommack,Shawn W. Polson,Zachary D. Schreiber,Mahdi Belcaid
DOI: https://doi.org/10.1186/s12859-024-05779-6
IF: 3.307
2024-04-28
BMC Bioinformatics
Abstract:The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?