Abstract:Following the milestone success of the Human Genome Project, the 'Encyclopedia of DNA Elements (ENCODE)' initiative was launched in 2003 to unearth information about the numerous functional elements within the genome. This endeavor coincided with the emergence of numerous novel technologies, accompanied by the provision of vast amounts of whole-genome sequences, high-throughput data such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful information from this massive dataset has become a critical aspect of many recent studies, particularly in annotating and predicting the functions of unknown genes. The core idea behind genome annotation is to identify genes and various functional elements within the genome sequence and infer their biological functions. Traditional wet-lab experimental methods still rely on extensive efforts for functional verification. However, early bioinformatics algorithms and software primarily employed shallow learning techniques; thus, the ability to characterize data and features learning was limited. With the widespread adoption of RNA-Seq technology, scientists from the biological community began to harness the potential of machine learning and deep learning approaches for gene structure prediction and functional annotation. In this context, we reviewed both conventional methods and contemporary deep learning frameworks, and highlighted novel perspectives on the challenges arising during annotation underscoring the dynamic nature of this evolving scientific landscape.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore and summarize the applications of traditional methods and deep - learning frameworks in genome annotation (genome annotation), especially how to use deep - learning techniques to improve the accuracy and efficiency of genome annotation. Specifically, the paper mainly focuses on the following aspects: 1. **Challenges in genome annotation**: - **Massive data processing**: With the development of high - throughput sequencing technologies (such as RNA - Seq, ChIP - Seq), a large amount of whole - genome sequence data has been generated. Extracting biologically meaningful information from these data has become crucial. - **Limitations of traditional methods**: Traditional experimental methods and early bioinformatics algorithms mainly rely on shallow - learning techniques, which limit their ability to learn features and represent data, especially when facing the challenges of cost and technical accessibility in processing high - throughput data. 2. **Applications of deep learning**: - **Feature extraction and prediction**: Deep - learning models can abstract data through multiple - layer non - linear functions, extract representative features from large - scale datasets, and make more accurate predictions on the functions of DNA fragments. - **Automation and flexibility**: Deep learning can automatically learn features and rules under appropriate data and model training, reducing the need for manual intervention. 3. **Specific application scenarios**: - **Identification and classification of transposable elements (TEs)**: Traditional bioinformatics software is prone to a high false - positive rate when identifying transposable elements, while deep - learning methods such as DeepTE and Inpactor2 significantly improve the identification accuracy and efficiency. - **Prediction of protein - coding genes**: The gene - coding regions of eukaryotes are usually discontinuous, containing exons and introns. Deep - learning methods can better capture these complex structures and improve prediction accuracy. 4. **Future development directions**: - **Integration of new technologies and methods**: The paper emphasizes that genome annotation is a continuously evolving process and needs to continuously introduce the latest technologies and methods to improve the accuracy and reliability of annotation. - **Meeting the challenges of genomic variation and regulatory elements**: Deep learning has unique advantages in dealing with genomic variation (such as SVs) and regulatory elements (such as promoters, enhancers) and can better understand gene - regulatory mechanisms. In summary, this paper attempts to show the potential of deep learning in genome annotation by comparing traditional methods and deep - learning frameworks and provide new perspectives and directions for future research.

From tradition to innovation: conventional and deep learning frameworks in genome annotation

Versatile Interactions and Bioinformatics Analysis of Noncoding RNAs

Progress on deep learning in genomics

Building better genome annotations across the tree of life

Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models

Application of deep learning in genomics

Identification, Design, and Application of Noncoding Cis-Regulatory Elements

Characterizing and Annotating the Genome Using RNA-seq Data

Comprehensive Functional Annotation of Metagenomes and Microbial Genomes Using a Deep Learning-Based Method

A Deep Learning-Based Sequence Analyzer Incorporating the Transcription Factor Binding Affinity to Dissect the Effects of Non-Coding Genetic Variants

Enhancing Gene Expression Predictions Using Deep Learning and Functional Annotations

Deep learning approaches for non-coding genetic variant effect prediction: current progress and future prospects

Modern tools for annotation of small genomes of non-model eukaryotes

Advances in Human Genome Resolution: the Role of Pan-Genomic Strategies and Fine-Tuning Pre-trained Genomic Models

DeepAnnotation: A Novel Interpretable Deep Learning-Based Genomic Selection Model That Integrates Comprehensive Functional Annotations

Deep learning: new computational modelling techniques for genomics

Deep Learning for Genomics: A Concise Overview

A scoping review on deep learning for next-generation RNA-Seq. data analysis

Machine Learning and Deep Learning Applications in Metagenomic Taxonomy and Functional Annotation

Predicting Functional Elements and Variants Effects in Non-Coding Regions Based on Deep Learning

Chromatin Accessibility Prediction Via a Hybrid Deep Convolutional Neural Network