From tradition to innovation: conventional and deep learning frameworks in genome annotation

Zhaojia Chen,Noor ul Ain,Qian Zhao,Xingtan Zhang
DOI: https://doi.org/10.1093/bib/bbae138
IF: 9.5
2024-04-08
Briefings in Bioinformatics
Abstract:Following the milestone success of the Human Genome Project, the 'Encyclopedia of DNA Elements (ENCODE)' initiative was launched in 2003 to unearth information about the numerous functional elements within the genome. This endeavor coincided with the emergence of numerous novel technologies, accompanied by the provision of vast amounts of whole-genome sequences, high-throughput data such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful information from this massive dataset has become a critical aspect of many recent studies, particularly in annotating and predicting the functions of unknown genes. The core idea behind genome annotation is to identify genes and various functional elements within the genome sequence and infer their biological functions. Traditional wet-lab experimental methods still rely on extensive efforts for functional verification. However, early bioinformatics algorithms and software primarily employed shallow learning techniques; thus, the ability to characterize data and features learning was limited. With the widespread adoption of RNA-Seq technology, scientists from the biological community began to harness the potential of machine learning and deep learning approaches for gene structure prediction and functional annotation. In this context, we reviewed both conventional methods and contemporary deep learning frameworks, and highlighted novel perspectives on the challenges arising during annotation underscoring the dynamic nature of this evolving scientific landscape.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore and summarize the applications of traditional methods and deep - learning frameworks in genome annotation (genome annotation), especially how to use deep - learning techniques to improve the accuracy and efficiency of genome annotation. Specifically, the paper mainly focuses on the following aspects: 1. **Challenges in genome annotation**: - **Massive data processing**: With the development of high - throughput sequencing technologies (such as RNA - Seq, ChIP - Seq), a large amount of whole - genome sequence data has been generated. Extracting biologically meaningful information from these data has become crucial. - **Limitations of traditional methods**: Traditional experimental methods and early bioinformatics algorithms mainly rely on shallow - learning techniques, which limit their ability to learn features and represent data, especially when facing the challenges of cost and technical accessibility in processing high - throughput data. 2. **Applications of deep learning**: - **Feature extraction and prediction**: Deep - learning models can abstract data through multiple - layer non - linear functions, extract representative features from large - scale datasets, and make more accurate predictions on the functions of DNA fragments. - **Automation and flexibility**: Deep learning can automatically learn features and rules under appropriate data and model training, reducing the need for manual intervention. 3. **Specific application scenarios**: - **Identification and classification of transposable elements (TEs)**: Traditional bioinformatics software is prone to a high false - positive rate when identifying transposable elements, while deep - learning methods such as DeepTE and Inpactor2 significantly improve the identification accuracy and efficiency. - **Prediction of protein - coding genes**: The gene - coding regions of eukaryotes are usually discontinuous, containing exons and introns. Deep - learning methods can better capture these complex structures and improve prediction accuracy. 4. **Future development directions**: - **Integration of new technologies and methods**: The paper emphasizes that genome annotation is a continuously evolving process and needs to continuously introduce the latest technologies and methods to improve the accuracy and reliability of annotation. - **Meeting the challenges of genomic variation and regulatory elements**: Deep learning has unique advantages in dealing with genomic variation (such as SVs) and regulatory elements (such as promoters, enhancers) and can better understand gene - regulatory mechanisms. In summary, this paper attempts to show the potential of deep learning in genome annotation by comparing traditional methods and deep - learning frameworks and provide new perspectives and directions for future research.