Fine-Tuning Transformers For Genomic Tasks

Vlastimil Martinek,David Cechak,Katarina Gresova,Panagiotis Alexiou,Petr Simecek
DOI: https://doi.org/10.1101/2022.02.07.479412
2022-02-10
Abstract:Abstract Transformers are a type of neural network architecture that has been successfully used to achieve state-of-the-art performance in numerous natural language processing tasks. However, what about DNA, the language life written in the four-letter alphabet? In this paper, we review the current state of Transformers usage in genomics and molecular biology in general, introduce a collection of benchmark datasets for the classification of genomic sequences, and compare the performance of several model architectures on those benchmarks, including a BERT-like model for DNA sequences DNABERT as implemented in HuggingFace (armheb/DNA_bert_6 model). In particular, we explore the effect of pre-training on a large DNA corpus vs training from scratch (with randomized weights). The results presented here can be used for identification of functional elements in human and other genomes.
What problem does this paper attempt to address?