Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA

Aimin Yang,Wei Zhang,Jiahao Wang,Ke Yang,Yang Han,Limin Zhang
DOI: https://doi.org/10.3389/fbioe.2020.01032
IF: 5.7
2020-09-04
Frontiers in Bioengineering and Biotechnology
Abstract:Deoxyribonucleic acid (DNA) is a biological macromolecule. Its main function is information storage. At present, the advancement of sequencing technology had caused DNA sequence data to grow at an explosive rate, which has also pushed the study of DNA sequences in the wave of big data. Moreover, machine learning is a powerful technique for analyzing largescale data and learns spontaneously to gain knowledge. It has been widely used in DNA sequence data analysis and obtained a lot of research achievements. Firstly, the review introduces the development process of sequencing technology, expounds on the concept of DNA sequence data structure and sequence similarity. Then we analyze the basic process of data mining, summary several major machine learning algorithms, and put forward the challenges faced by machine learning algorithms in the mining of biological sequence data and possible solutions in the future. Then we review four typical applications of machine learning in DNA sequence data: DNA sequence alignment, DNA sequence classification, DNA sequence clustering, and DNA pattern mining. We analyze their corresponding biological application background and significance, and systematically summarized the development and potential problems in the field of DNA sequence data mining in recent years. Finally, we summarize the content of the review and look into the future of some research directions for the next step.
multidisciplinary sciences
What problem does this paper attempt to address?
This paper mainly discusses the application of machine learning algorithms in DNA sequence data analysis. With the development of sequencing technology, DNA sequence data is growing explosively, which promotes the research of big biological data. Machine learning, as a powerful data analysis tool, is widely used in handling large-scale biological data to obtain knowledge. The paper first introduces the development history of sequencing technology and elaborates on the concept of DNA sequence data structure and sequence similarity. Then, it analyzes the basic process of data mining and outlines several major machine learning algorithms. It also presents the challenges faced by these algorithms in biological sequence data mining and possible future solutions. Next, the paper reviews four typical applications of machine learning in DNA sequence data: sequence alignment, sequence classification, sequence clustering, and pattern mining. It discusses in detail their biological background and significance, as well as recent developments and potential issues. The author points out that distributed sequence alignment and parallel computing may be the future focus of DNA sequence alignment research. In sequence classification, the key challenge lies in how to effectively represent sequence features. The key to sequence clustering is how to extract feature sub-sequences from DNA sequences. DNA pattern mining can generate a large number of candidate sequence patterns, requiring appropriate search strategies and elimination of redundant patterns. In addition, the paper also discusses the encoding methods of DNA sequences, such as ordinal encoding, one-hot encoding, and k-mer encoding. It emphasizes the importance of sequence similarity as the basis for DNA sequence data mining. Finally, the paper summarizes the content and prospects for future research directions, such as establishing a bridge between machine learning and bioinformatics to better analyze biomedical data.