LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language

Yong He,Pan Fang,Yongtao Shan,Yuanfei Pan,Yanhong Wei,Yichang Chen,Yihao Chen,Yi Liu,Zhenyu Zeng,Zhan Zhou,Feng Zhu,Edward C. Holmes,Jieping Ye,Jun Li,Yuelong Shu,Mang Shi,Zhaorong Li
DOI: https://doi.org/10.1101/2024.05.10.592927
2024-05-14
Abstract:In recent years, significant advancements have been observed in the domain of NLP with the introduction of pre-trained foundational models, paving the way for utilizing similar AI technologies to interpret the language of biology. In this research, we introduce "LucaOne", a novel pre-trained foundational model designed to integratively learn from the genetic and proteomic languages, encapsulating data from 169,861 species encompassing DNA, RNA, and proteins. This work illuminates the potential for creating a biological language model aimed at universal bioinformatics application. Remarkably, through few-shot learning, this model efficiently learns the central dogma of molecular biology and demonstrably outperforms competing models. Furthermore, in tasks requiring inputs of DNA, RNA, proteins, or a combination thereof, LucaOne exceeds the state-of-the-art performance using a streamlined downstream architecture, thereby providing empirical evidence and innovative perspectives on the potential of foundational models to comprehend complex biological systems.
Bioinformatics
What problem does this paper attempt to address?
This paper introduces a new pre-training base model called "LucaOne", which aims to integrate nucleic acids (DNA and RNA) and protein languages to process data from 169,861 species using a unified biological language model. LucaOne demonstrates superior performance in tasks involving DNA, RNA, proteins, or their combinations, by understanding the central dogma of molecular biology, even with a small amount of examples. The paper also explores how to utilize Transformer architecture and multi-faceted computation strategies to simultaneously process nucleic acid and protein data, extracting complex patterns and relationships. Experimental results confirm that LucaOne performs exceptionally well in various bioinformatics tasks, providing new perspectives and tools for understanding and deciphering complex biological systems.