OmniNA: A foundation model for nucleotide sequences

Xilin Shen,Xiangchun Li
DOI: https://doi.org/10.1101/2024.01.14.575543
2024-01-15
Abstract:Foundation models have demonstrated exceptional efficacy across diverse downstream tasks. However, within the realms of genomics and transcriptomics, a notable gap persists in the availability of models that afford a comprehensive understanding of nucleotide sequence principles across various species. Here, we present OmniNA, a foundation generative model designed for comprehensive nucleotide sequence learning. The model was pre-trained on 91.7 million nucleotide sequences and the corresponding annotations encompassing 1076.2 billion bases and 197 million words spanning a multitude of species. We demonstrated OmniNA gains the capacity to understand the semantics of the nucleotide sequence and textual annotations by analyzing the learned representation of the pre-trained model. OmniNA can be fine-tuned to align multiple nucleotide learning tasks with natural language paradigms. We demonstrate OmniNA-1.7B surpasses or rivals state-of-the art methods in 17 nucleotide tasks, encompassing nucleotide sequences detection and species classification. The model’s understanding of nucleotide grammars enhances its capability to reveal the mutation effect of nucleotide sequence on DNA and RNA processing. We hereby release the OmniNA-1.7B model as an open-source contribution to the research community. This foundation model signifies a step toward advancing our comprehension of nucleotide sequences across diverse species and holds substantial promise to facilitating genomics and transcriptomics research.
Bioinformatics
What problem does this paper attempt to address?
This paper introduces a new basic model called OmniNA, aiming to address the comprehensive understanding problem in nucleic acid sequence learning. Currently, there is a lack of models that can understand and interpret nucleic acid sequence principles across multiple species in the fields of genomics and transcriptomics. OmniNA learns the comprehensive representation of nucleic acid sequences by pretraining on 91.7 million sequences, including 1.0762 trillion bases and 197 million words. The model is able to understand the semantics of sequences and adapt to various nucleic acid learning tasks, such as nucleic acid detection and species classification, achieving performance comparable to or better than state-of-the-art methods in 17 nucleic acid tasks. Additionally, OmniNA can reveal the impact of nucleic acid sequence variations on DNA and RNA processing. The authors of the paper have released the OmniNA-1.7B model as an open-source contribution to promote advancements in genomics and transcriptomics research.