dnaGrinder: a lightweight and high-capacity genomic foundation model

Qihang Zhao,Chi Zhang,Weixiong Zhang
2024-09-24
Abstract:The task of understanding and interpreting the complex information encoded within genomic sequences remains a grand challenge in biological research and clinical applications. In this context, recent advancements in large language model research have led to the development of both encoder-only and decoder-only foundation models designed to decode intricate information in DNA sequences. However, several issues persist, particularly regarding the efficient management of long-range dependencies inherent in genomic sequences, the effective representation of nucleotide variations, and the considerable computational costs associated with large model architectures and extensive pretraining datasets. Current genomic foundation models often face a critical tradeoff: smaller models with mediocre performance versus large models with improved performance. To address these challenges, we introduce dnaGrinder, a unique and efficient genomic foundation model. dnaGrinder excels at managing long-range dependencies within genomic sequences while minimizing computational costs without compromising performance. It achieves results that are not just comparable but often superior to leading DNA models such as Nucleotide Transformer and DNABERT-2. Furthermore, dnaGrinder is designed for easy fine-tuning on workstation-grade GPUs, accommodating input lengths exceeding 17,000 tokens. On a single high-performance GPU, it supports sequences longer than 140,000 tokens, making it a highly efficient and accessible tool for both basic biological research and clinical applications.
Genomics,Artificial Intelligence,Computational Engineering, Finance, and Science,Computation and Language
What problem does this paper attempt to address?
The main problem this paper attempts to address is the limitations of existing genomic foundation models in handling long sequences, the large number of model parameters, and the high computational cost during pre-training and fine-tuning. Specifically: 1. **Insufficient long sequence processing capability**: Existing genomic foundation models like DNABERT and DNABERT-2 have limitations in handling long sequences, typically only being able to process sequences of up to 512 or 128 tokens. This limits their ability to analyze longer genomic sequences in downstream tasks. 2. **Large number of model parameters**: Existing genomic foundation models (such as Nucleotide Transformer) have a large number of parameters, leading to high computational costs during pre-training and fine-tuning, requiring more computational resources and time. 3. **High computational cost**: The need to process large amounts of data and complex model structures during pre-training and fine-tuning results in high computational costs, especially when fine-tuning on workstation-level GPUs. To address these challenges, the paper introduces a new genomic foundation model—**dnaGrinder**. This model improves upon the shortcomings of existing models through the following methods: - **Efficient long sequence processing**: dnaGrinder can handle input sequences of over 17,000 tokens and supports sequences of over 140,000 tokens on a single high-performance GPU, significantly enhancing the ability to process long sequences. - **Reduced computational cost**: Through optimized pre-training strategies and model architecture, dnaGrinder reduces the number of parameters and computational cost while maintaining high performance. - **Improved pre-training dataset**: By increasing genomic diversity rather than simply repeating sequences, the model's generalization ability and learning effectiveness are enhanced. In summary, dnaGrinder aims to provide an efficient, lightweight, and high-performance genomic foundation model suitable for fundamental biological research and clinical applications.