Feature extraction from complex networks: A case of study in genomic sequences classification

Bruno Mendes Moro Conque,André Yoshiaki Kashiwabara,Fabrício Martins Lopes
DOI: https://doi.org/10.48550/arXiv.1412.5627
2014-12-18
Abstract:This work presents a new approach for classification of genomic sequences from measurements of complex networks and information theory. For this, it is considered the nucleotides, dinucleotides and trinucleotides of a genomic sequence. For each of them, the entropy, sum entropy and maximum entropy values are <a class="link-external link-http" href="http://calculated.For" rel="external noopener nofollow">this http URL</a> each of them is also generated a network, in which the nodes are the nucleotides, dinucleotides or trinucleotides and its edges are estimated by observing the respective adjacency among them in the genomic sequence. In this way, it is generated three networks, for which measures of complex networks are <a class="link-external link-http" href="http://extracted.These" rel="external noopener nofollow">this http URL</a> measures together with measures of information theory comprise a feature vector representing a genomic sequence. Thus, the feature vector is used for classification by methods such as SVM, MultiLayer Perceptron, J48, IBK, Naive Bayes and Random Forest in order to evaluate the proposed <a class="link-external link-http" href="http://approach.It" rel="external noopener nofollow">this http URL</a> was adopted coding sequences, intergenic sequences and TSS (Transcriptional Starter Sites) as datasets, for which the better results were obtained by the Random Forest with 91.2%, followed by J48 with 89.1% and SVM with 84.8% of accuracy. These results indicate that the new approach of feature extraction has its value, reaching good levels of classification even considering only the genomic sequences, i.e., no other a priori knowledge about them is considered.
Computational Engineering, Finance, and Science,Machine Learning,Quantitative Methods
What problem does this paper attempt to address?