Genetic Clustering Algorithm-Based Feature Selection and Divergent Random Forest for Multiclass Cancer Classification Using Gene Expression Data

Senbagamalar, L.
DOI: https://doi.org/10.1007/s44196-024-00416-9
IF: 2.259
2024-02-06
International Journal of Computational Intelligence Systems
Abstract:Computational identification and classification of clinical disorders gather major importance due to the effective improvement of machine learning methodologies. Cancer identification and classification are essential clinical areas to address, where accurate classification for multiple types of cancer is still in a progressive stage. In this article, we propose a multiclass cancer classification model that categorizes the five different types of cancers using gene expression data. To perform efficient analysis of the available clinical data, we propose feature selection and classification methods. We propose a genetic clustering algorithm (GCA) for optimal feature selection from the RNA-gene expression data, consisting of 801 samples belonging to the five major classes of cancer. The proposed feature selection method reduces the 1621 gene expressions into a cluster of 21 features. The optimum feature set acts as input data to the proposed divergent random forest. Based on the features computed, the proposed classifier categorizes the data samples into 5 different classes of cancers, including breast cancer, colon cancer, kidney cancer, lung cancer, and prostate cancer. The proposed divergent random forest provided performance improvisation in terms of accuracy with 95.21%, specificity with 93%, and sensitivity with 94.29% which outperformed all the other existing multiclass classification algorithms.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?
Based on the provided text content, the problems that this paper attempts to solve can be summarized as follows: In the identification and classification of cancer, especially the accurate classification of multi - class cancers, it is still an ongoing research area. Although the existing binary classification methods (such as distinguishing cancer samples from normal samples) have played a role in effective diagnostic tools in the continuous monitoring stage, these methods have limitations when dealing with a large amount of gene expression data, because these data may contain multiple types of cancer samples, not just two categories. Therefore, it is particularly urgent to develop a multi - class classification model that can efficiently select features and effectively classify. Specifically, this paper proposes a feature selection method based on the Genetic Clustering Algorithm (GCA) and a new Divergent Random Forest (DF) classifier, aiming to solve the following problems: 1. **Dimensionality reduction of high - dimensional gene expression data**: Select the optimal feature set from a large amount of gene expression data through the genetic clustering algorithm to reduce the dimension and complexity of the data. 2. **Multi - class cancer classification**: Use the divergent random forest classifier to classify five different types of cancer (breast cancer, colon cancer, kidney cancer, lung cancer, and prostate cancer). 3. **Improve classification performance**: By optimizing feature selection and classification methods, improve the accuracy, specificity, and sensitivity of classification, thereby outperforming existing classification algorithms in multi - class cancer classification tasks. The main contribution of the paper is to propose a new genetic clustering algorithm and a divergent random forest classifier, which can achieve a high accuracy rate (95.21%), specificity (93%), and sensitivity (94.29%) in multi - class cancer classification tasks, significantly better than other existing multi - class classification algorithms.