DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks

Daoan Zhang,Weitong Zhang,Yu Zhao,Jianguo Zhang,Bing He,Chenchen Qin,Jianhua Yao

2023-08-31

Abstract:Pre-trained large language models demonstrate potential in extracting information from DNA sequences, yet adapting to a variety of tasks and data modalities remains a challenge. To address this, we propose DNAGPT, a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals. By enhancing the classic GPT model with a binary classification task (DNA sequence order), a numerical regression task (guanine-cytosine content prediction), and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks while processing both sequence and numerical data. Our evaluation of genomic signal and region recognition, mRNA abundance regression, and artificial genomes generation tasks demonstrates DNAGPT's superior performance compared to existing models designed for specific downstream tasks, benefiting from pre-training using the newly designed model structure.

Genomics,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to construct a general pre - training model in DNA sequence analysis tasks, which can process multiple types of data (such as sequence data and numerical data) simultaneously and be applicable to various downstream tasks. Although existing pre - training models perform well on specific tasks, they usually can only process a single type of data and lack the ability to integrate cross - species information, which limits their generality and adaptability. To meet this challenge, the authors propose DNAGPT, a general DNA pre - training model based on the Transformer architecture. DNAGPT solves the above problems in the following ways: 1. **Multi - task pre - training**: In addition to the classic autoregressive prediction task, DNAGPT also introduces two new pre - training tasks: - **DNA sequence order prediction**: By randomly flipping the input sequence, let the model predict whether the flipping operation has been carried out, so as to enhance the model's understanding of the sequence order. - **Guanine - Cytosine (GC) content prediction**: Randomly extract a segment from the input sequence, calculate its GC content, and let the model predict this value, so as to improve the model's ability to process numerical data. 2. **Integrated markup language**: A hierarchical markup language structure is designed to encode DNA sequences, numerical attributes and task - related information, so that different types of inputs can be processed under the same framework. 3. **Cross - species data pre - training**: Use the reference genomes from all mammals for pre - training, with a total data volume of more than 200 billion base pairs, so as to enhance the model's generalization ability and the ability to integrate cross - species information. Through these methods, DNAGPT can perform well in a variety of DNA analysis tasks, including genome signal and region identification, mRNA abundance prediction and artificial genome generation tasks. Experimental results show that DNAGPT outperforms existing specially - designed models in these tasks.

DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks

DNAGPT: A Generalized Pre-trained Tool for Multiple DNA Sequence Analysis Tasks

DNAHLM -- DNA sequence and Human Language mixed large language Model

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

CD-GPT: A Biological Foundation Model Bridging the Gap between Molecular Sequences Through Central Dogma

RNA-GPT: Multimodal Generative System for RNA Sequence Understanding

GeneGPT: augmenting large language models with domain tools for improved access to biomedical information

Generative Language Models on Nucleotide Sequences of Human Genes

Toward Understanding BERT-Like Pre-Training for DNA Foundation Models

dnaGrinder: a lightweight and high-capacity genomic foundation model

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

Long-range gene expression prediction with token alignment of large language model

DeepGene: An Efficient Foundation Model for Genomics based on Pan-genome Graph Transformer

Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale

GP-GPT: Large Language Model for Gene-Phenotype Mapping

GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models