Simulation and Analysis of Bionanopore Dna Sequencing Signals for Genetic Mutations Detection
Iryna M. Ievdoshchenko,Kateryna Olehivna Ivanko,Nataliia Heorhiivna Ivanushkina,Vishwesh Kulkarni
DOI: https://doi.org/10.20535/2523-4455.mea.217265
2021-04-29
Abstract:The application of genomic signal processing methods to the problem of modeling and analysis of nanoporous DNA sequencing signals is considered in the paper. Based on the nucleotide sequences in the norm and in the case of mutations, 1200 signals are simulated, which represent 4 classes: norm, missense mutation, insertion mutation and deletion mutation. Correlation analysis was used to determine the similarity of nanoporous DNA sequencing signals using a cross-correlation function between two current signals in the protein nanopore, specifically signal in norm and in the presence of mutation. The location of the correlation peak determines the type of mutation (insertion or deletion), as well as the alignment of the same nucleotide sequences using a defined signal shift. The results of applying machine learning methods to the problem of classification of nanoporous DNA sequencing signals significantly depend on the noise level of the registered current signals through the protein nanopore and the type of mutation. Given a relatively low noise level, when the values of the ion current through a protein nanopore for different nucleotides do not intersect, the classification accuracy reaches 100%. In the case of increasing the standard deviation of the law of distribution of noise components, there is an overlap of the levels of current values in the nanopore in the case of its blocking by nucleotides of the close size. As a result, errors in the definition of normal and single nucleotide mutations (missense or nonsense) often occur, especially if the levels of current steps in the nanopore for two nucleotides are similar (for example, guanine and thymine, thymine and adenine, adenine and cytosine) and noise masks their contribution to reduction current in the nanopore. Mutations of insertion and deletion of a certain nucleotide sequence are often classified without errors, because these mutations are characterized by a shift of several nucleotides between normal signals and pathology, which increases the distance between these signals. Among the machine learning methods that have demonstrated the high accuracy of classification of the signals of nanopore-based DNA sequencing, the methods of linear discriminant, k-nearest neighbors classifier (with Euclidean distance and the sufficient number of nearest neighbors), as well as the method of reference vectors should be mentioned. The best results were obtained for the classification method of support vector machines. The use of linear, quadratic and cubic kernel functions shows the high accuracy of correctly classified signals - from 93 to 100%.