LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language
Yong He,Pan Fang,Yongtao Shan,Yuanfei Pan,Yanhong Wei,Yichang Chen,Yihao Chen,Yi Liu,Zhenyu Zeng,Zhan Zhou,Feng Zhu,Edward C. Holmes,Jieping Ye,Jun Li,Yuelong Shu,Mang Shi,Zhaorong Li
DOI: https://doi.org/10.1101/2024.05.10.592927
2024-05-14
Abstract:In recent years, significant advancements have been observed in the domain of NLP with the introduction of pre-trained foundational models, paving the way for utilizing similar AI technologies to interpret the language of biology. In this research, we introduce "LucaOne", a novel pre-trained foundational model designed to integratively learn from the genetic and proteomic languages, encapsulating data from 169,861 species encompassing DNA, RNA, and proteins. This work illuminates the potential for creating a biological language model aimed at universal bioinformatics application. Remarkably, through few-shot learning, this model efficiently learns the central dogma of molecular biology and demonstrably outperforms competing models. Furthermore, in tasks requiring inputs of DNA, RNA, proteins, or a combination thereof, LucaOne exceeds the state-of-the-art performance using a streamlined downstream architecture, thereby providing empirical evidence and innovative perspectives on the potential of foundational models to comprehend complex biological systems.
Bioinformatics