Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

Aditya Malusare,Harish Kothandaraman,Dipesh Tamboli,Nadia A. Lanman,Vaneet Aggarwal
2024-02-14
Abstract:This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
Machine Learning,Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the understanding and analysis of DNA sequences by developing a basic model named Ensemble Nucleotide Byte - level Encoder - Decoder (ENBED). Specifically, ENBED aims to surpass existing technologies in the following aspects: 1. **Enhanced sequence - to - sequence conversion ability**: ENBED adopts an encoder - decoder architecture and is able to perform more complex sequence - to - sequence conversion tasks, which is especially important in the process of DNA being transcribed into RNA sequences and subsequently translated into protein sequences. 2. **Improved precision and efficiency**: By using sub - quadratic implementation of the attention mechanism (such as sliding - window attention and global attention), ENBED can reduce computational complexity while maintaining model performance, thus being able to handle longer input and output sequences. 3. **Improved performance in downstream tasks**: ENBED is applied to multiple downstream tasks, including identifying enhancers, promoters and splicing sites; identifying sequences containing base - calling mismatches and insertion / deletion errors; identifying biological function annotations of genomic sequences; and generating mutations of influenza viruses and verifying the consistency of these mutations with actual observations. In these tasks, ENBED shows significant improvement compared to the existing state - of - the - art technologies. 4. **Achievement of byte - level precision**: ENBED uses a single - nucleotide - based byte - level tokenization scheme. Although this method increases the computational cost, it improves the robustness of the model to DNA sequence changes and noise, especially when dealing with long repetitive sequences (such as telomeres). In conclusion, ENBED aims to improve the precision and efficiency of DNA sequence analysis by introducing advanced deep - learning techniques and innovative model architectures, thereby promoting research and development in the field of bioinformatics.