Abstract:This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the understanding and analysis of DNA sequences by developing a basic model named Ensemble Nucleotide Byte - level Encoder - Decoder (ENBED). Specifically, ENBED aims to surpass existing technologies in the following aspects: 1. **Enhanced sequence - to - sequence conversion ability**: ENBED adopts an encoder - decoder architecture and is able to perform more complex sequence - to - sequence conversion tasks, which is especially important in the process of DNA being transcribed into RNA sequences and subsequently translated into protein sequences. 2. **Improved precision and efficiency**: By using sub - quadratic implementation of the attention mechanism (such as sliding - window attention and global attention), ENBED can reduce computational complexity while maintaining model performance, thus being able to handle longer input and output sequences. 3. **Improved performance in downstream tasks**: ENBED is applied to multiple downstream tasks, including identifying enhancers, promoters and splicing sites; identifying sequences containing base - calling mismatches and insertion / deletion errors; identifying biological function annotations of genomic sequences; and generating mutations of influenza viruses and verifying the consistency of these mutations with actual observations. In these tasks, ENBED shows significant improvement compared to the existing state - of - the - art technologies. 4. **Achievement of byte - level precision**: ENBED uses a single - nucleotide - based byte - level tokenization scheme. Although this method increases the computational cost, it improves the robustness of the model to DNA sequence changes and noise, especially when dealing with long repetitive sequences (such as telomeres). In conclusion, ENBED aims to improve the precision and efficiency of DNA sequence analysis by introducing advanced deep - learning techniques and innovative model architectures, thereby promoting research and development in the field of bioinformatics.

Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling

Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

Benchmarking DNA Foundation Models for Genomic Sequence Classification

dnaGrinder: a lightweight and high-capacity genomic foundation model

SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models

DeepGene: An Efficient Foundation Model for Genomics based on Pan-genome Graph Transformer

GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

Embed-Search-Align: DNA Sequence Alignment using Transformer Models

ntEmbd: Deep learning embedding for nucleotide sequences

The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

Scalable DNA Feature Generation and Transcription Factor Binding Prediction via Deep Surrogate Models

Beyond DNA: ML-Empowered Nanopore Base-Calling of 12-Letter Genetic Alphabets

GENA-Web - GENomic Annotations Web Inference using DNA language models

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Enhancing Personalized Gene Expression Prediction From DNA Sequences Using Genomic Foundation Models

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA