DNAGPT: A Generalized Pre-trained Tool for Multiple DNA Sequence Analysis Tasks

Daoan Zhang,Weitong Zhang,Yu Zhao,Jianguo Zhang,Bing He,Chenchen Qin,Jianhua Yao
DOI: https://doi.org/10.1101/2023.07.11.548628
2024-01-04
Abstract:Pre-trained large language models demonstrate potential in extracting information from DNA sequences, yet adapting to a variety of tasks and data modalities remains a challenge. To address this, we propose DNAGPT, a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals. By enhancing the classic GPT model with a binary classification task (DNA sequence order), a numerical regression task (guanine-cytosine content prediction), and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks while processing both sequence and numerical data. Our evaluation of genomic signal and region recognition, mRNA abundance regression, and artificial genome generation tasks demonstrates DNAGPT’s superior performance compared to existing models designed for specific downstream tasks, benefiting from pre-training using the newly designed model structure.
Genomics
What problem does this paper attempt to address?
This paper introduces a universal pre-training model called DNAGPT, aimed at addressing diverse tasks in DNA sequence analysis. Existing models face challenges in adapting to different tasks and data types. DNAGPT enhances the classical GPT model by training on over 20 billion base pairs across all mammals, incorporating binary classification tasks (DNA sequence prediction) and numerical regression tasks (prediction of guanine-cytosine content), as well as comprehensive markup languages to handle various analysis tasks involving DNA sequences and numerical data. The paper mentions that although previous models such as DNABERT and Nucleotide Transformers perform well on specific downstream tasks, they fail to handle tasks involving numerical input or output, such as regression of mRNA abundance. Evaluation of DNAGPT on tasks such as genomic signal and region recognition, mRNA abundance regression, and artificial genome generation demonstrates its superior performance compared to existing models. With this universal pre-training model, researchers aim to improve the efficiency of DNA information extraction and utilization, adapt to various downstream tasks related to DNA, accelerate research progress, enhance accuracy, and reduce waste of resources in duplicate research.