Abstract:Microarray dataset frequently contains a countless number of insignificant and irrelevant genes that might lead to loss of valuable data. The classes with both high importance and high significance gene sets are commonly preferred for selecting the genes, which determines the sample classification into their particular classes. This property has obtained a lot of importance among the specialists and experts in microarray dataset classification. The trained classifier model is tested for cancer datasets and Huntington disease data (HD) which consists of Prostate cancer (Singh) dataset comprising 102 samples, 52 of which are tumors and 50 are normal with 12625 genes. The lung cancer (Gordon) dataset comprises 181 samples, 150 of which are normal and 31 are tumors with 12533 genes. The breast cancer (Chin) dataset comprises 118 samples, 43 of which are normal and 75 are tumors with 22215 genes. The breast cancer (Chowdary) dataset comprises 104 samples, 62 of which are normal and 42 are tumors with 22283 genes. Finally, the Huntington disease (Borovecki) dataset comprises 31 samples, 14 of which are normal and 17 are with Huntington's disease with 22283 genes. This paper uses Multilayer Perceptron Classifier (MLP), Random Forest (RF) and Linear Support Vector classifier (LSVC) classification algorithms with six different feature selection methods named as Principal Component Analysis (PCA), Extra Tree Classifier (ETC), Analysis of Variance (ANOVA), Least Absolute Shrinkage and Selection Operator (LASSO), Chi-Square and Random Forest Regressor (RFR). Further, the paper presents a comparative analysis on the obtained classification accuracy and time consumed among the models in Spark environment and in conventional system. Performance parameters such as accuracy and time consumed are applied in this comparative analysis to analyze the behavior of the classifiers in the two environments. Th results indicate that the models in spark environment was extremely effective for processing large-dimension data, which cannot be processed with conventional implementation related to a some algorithms. After that, a proposed hybrid model containing embedded approach (LASSO) and the Filter (ANOVA) approach was used to select the optimized features form the high dimensional dataset. With the reduced dimension of features, classification is performed on the reduced data set to classify the samples into normal or abnormal and applied in spark in hadoop cluster (distributed manner). The proposed model achieved accuracy of 100% in case of Borovecki dataset when using all classifiers, 100% in case of Singh, Chowdary and Gordon datasets when classified with RF and LSVC classifiers. Also, accuracy was 96% in case of Chin dataset when using RF classifier with optimal genes with respect to accuracy and time consumed.

A novel parallel feature rank aggregation algorithm for gene selection applied to microarray data classification

Parameters Selection in Gene Selection Using Gaussian Kernel Support Vector Machines by Genetic Algorithm

A hybrid feature selection approach for Microarray datasets using graph theoretic-based method

Gene Selection Algorithm Based on Correlation Analysis

Distance-based mutual congestion feature selection with genetic algorithm for high-dimensional medical datasets

A Novel Hybrid Gene Selection Based on Random Forest Approach and Binary Dragonfly Algorithm

Multilevel Feature Selection Method for Improving Classification of Microarray Gene Expression Data

A novel hybrid algorithm based on Harris Hawks for tumor feature gene selection

Gene selection for cancer classification using a hybrid of univariate and multivariate feature selection methods

A Novel Approach for Single Gene Selection Using Clustering and Dimensionality Reduction

Feature selection for classification of microarray gene expression cancers using Bacterial Colony Optimization with multi-dimensional population

Feature selection of microarray data using multidimensional graph neural network and supernode hierarchical clustering

A comparative study of nature-inspired metaheuristic algorithms using a three-phase hybrid approach for gene selection and classification in high-dimensional cancer datasets

Median Selection Subset Aggregation for Parallel Inference

Gene Features Selection for Three-Class Disease Classification via Multiple Orthogonal Partial Least Square Discriminant Analysis and S-Plot Using Microarray Data

Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data

C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods

Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets

Hybrid ANOVA and LASSO Methods for Feature Selection and Linear Support Vector, Multilayer Perceptron and Random Forest Classifiers Based on Spark Environment for Microarray Data Classification

A novel and innovative cancer classification framework through a consecutive utilization of hybrid feature selection

Gene selection and classification for cancer microarray data based on machine learning and similarity measures