GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

Veniamin Fishman,Yuri Kuratov,Aleksei Shmelev,Maxim Petrov,Dmitry Penzar,Denis Shepelin,Nikolay Chekanov,Olga Kardymon,Mikhail Burtsev
DOI: https://doi.org/10.1101/2023.06.12.544594
2024-08-23
Abstract:Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities in interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent in DNA function. A significant challenge, however, resides in accurately decoding genomic sequences, which inherently involves comprehending rich contextual information dispersed across thousands of nucleotides. To address this need, we introduce GENA-LM, a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36,000 base pairs. Notably, integration of the newly-developed Recurrent Memory mechanism allows these models to process even larger DNA segments. We provide pre-trained versions of GENA-LM, demonstrating their capability for fine-tuning and addressing a spectrum of complex biological tasks with modest computational demands. While language models have already achieved significant breakthroughs in protein biology, GENA-LM showcases a similarly promising potential for reshaping the landscape of genomics and multi-omics data analysis. All models are publicly available on GitHub https://github.com/AIRI-Institute/GENA LM and HuggingFace https://huggingface.co/AIRI-Institute.
Bioinformatics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to accurately decode genomic information in long DNA sequences, especially when dealing with abundant contextual information scattered across thousands of nucleotides. Specifically, the paper introduces GENA - LM, an open - source DNA language model family based on the Transformer architecture, which is capable of handling input sequences up to 36,000 base pairs long. By introducing a new recursive memory mechanism, these models can handle even longer DNA fragments. The paper demonstrates the application of GENA - LM in a variety of complex biological tasks, such as predicting promoter activity, splicing sites, polyadenylation sites, enhancer annotation, and chromatin profiles, and these tasks require handling long - range dependencies. ### Main Contributions: 1. **Handling Long DNA Sequences**: GENA - LM can handle input sequences up to 36,000 base pairs long, which is a significant improvement among existing DNA language models. 2. **Recursive Memory Mechanism**: A recursive memory mechanism has been introduced, enabling the model to handle longer DNA fragments. 3. **Multi - species Models**: Pre - trained models for multiple species and specific taxa are provided, demonstrating its generalization ability across different species. 4. **Performance Evaluation**: A comprehensive evaluation has been carried out on multiple genomic tasks, including predicting promoter activity, splicing sites, polyadenylation sites, enhancer annotation, and chromatin profiles, demonstrating the superior performance of GENA - LM on these tasks. 5. **Open - source Code**: All models are publicly released on GitHub and HuggingFace, and a user - friendly web service is provided for users to perform DNA annotation conveniently. ### Problems Solved: - **Long - range Dependencies**: Traditional machine - learning methods have limitations when dealing with long - range dependencies, while GENA - LM effectively solves this problem through the Transformer architecture and the recursive memory mechanism. - **Multi - species Applicability**: GENA - LM is applicable not only to the human genome but also to other species, such as yeast, Arabidopsis thaliana, and Drosophila. - **Efficiency**: Through pre - training and fine - tuning, GENA - LM can handle complex biological tasks with limited computational resources. ### Conclusion: GENA - LM has demonstrated excellent performance in handling long DNA sequences and complex biological tasks, and is expected to reshape the landscape of genomics and multi - omics data analysis.