Abstract:The task of understanding and interpreting the complex information encoded within genomic sequences remains a grand challenge in biological research and clinical applications. In this context, recent advancements in large language model research have led to the development of both encoder-only and decoder-only foundation models designed to decode intricate information in DNA sequences. However, several issues persist, particularly regarding the efficient management of long-range dependencies inherent in genomic sequences, the effective representation of nucleotide variations, and the considerable computational costs associated with large model architectures and extensive pretraining datasets. Current genomic foundation models often face a critical tradeoff: smaller models with mediocre performance versus large models with improved performance. To address these challenges, we introduce dnaGrinder, a unique and efficient genomic foundation model. dnaGrinder excels at managing long-range dependencies within genomic sequences while minimizing computational costs without compromising performance. It achieves results that are not just comparable but often superior to leading DNA models such as Nucleotide Transformer and DNABERT-2. Furthermore, dnaGrinder is designed for easy fine-tuning on workstation-grade GPUs, accommodating input lengths exceeding 17,000 tokens. On a single high-performance GPU, it supports sequences longer than 140,000 tokens, making it a highly efficient and accessible tool for both basic biological research and clinical applications.

What problem does this paper attempt to address?

The main problem this paper attempts to address is the limitations of existing genomic foundation models in handling long sequences, the large number of model parameters, and the high computational cost during pre-training and fine-tuning. Specifically: 1. **Insufficient long sequence processing capability**: Existing genomic foundation models like DNABERT and DNABERT-2 have limitations in handling long sequences, typically only being able to process sequences of up to 512 or 128 tokens. This limits their ability to analyze longer genomic sequences in downstream tasks. 2. **Large number of model parameters**: Existing genomic foundation models (such as Nucleotide Transformer) have a large number of parameters, leading to high computational costs during pre-training and fine-tuning, requiring more computational resources and time. 3. **High computational cost**: The need to process large amounts of data and complex model structures during pre-training and fine-tuning results in high computational costs, especially when fine-tuning on workstation-level GPUs. To address these challenges, the paper introduces a new genomic foundation model—**dnaGrinder**. This model improves upon the shortcomings of existing models through the following methods: - **Efficient long sequence processing**: dnaGrinder can handle input sequences of over 17,000 tokens and supports sequences of over 140,000 tokens on a single high-performance GPU, significantly enhancing the ability to process long sequences. - **Reduced computational cost**: Through optimized pre-training strategies and model architecture, dnaGrinder reduces the number of parameters and computational cost while maintaining high performance. - **Improved pre-training dataset**: By increasing genomic diversity rather than simply repeating sequences, the model's generalization ability and learning effectiveness are enhanced. In summary, dnaGrinder aims to provide an efficient, lightweight, and high-performance genomic foundation model suitable for fundamental biological research and clinical applications.

dnaGrinder: a lightweight and high-capacity genomic foundation model

DeepGene: An Efficient Foundation Model for Genomics based on Pan-genome Graph Transformer

Modeling a Delayed Coking Process with GRNN and Double-Chain Based DNA Genetic Algorithm

Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

Benchmarking DNA Foundation Models for Genomic Sequence Classification

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models

The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

Efficient and Scalable Fine-Tune of Language Models for Genome Understanding

Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks

DNAGPT: A Generalized Pre-trained Tool for Multiple DNA Sequence Analysis Tasks

Enhancing Personalized Gene Expression Prediction From DNA Sequences Using Genomic Foundation Models

gReLU: A comprehensive framework for DNA sequence modeling and design

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

DNA language model GROVER learns sequence context in the human genome

GENA-Web - GENomic Annotations Web Inference using DNA language models

DNAHLM -- DNA sequence and Human Language mixed large language Model