Abstract:Prediction of the stage of cancer plays an important role in planning the course of treatment and has been largely reliant on imaging tools which do not capture molecular events that cause cancer progression. Gene-expression data-based analyses are able to identify these events, allowing RNA-sequence and microarray cancer data to be used for cancer analyses. Breast cancer is the most common cancer worldwide, and is classified into four stages - stages 1, 2, 3, and 4 [2]. While machine learning models have previously been explored to perform stage classification with limited success, multi-class stage classification has not had significant progress. There is a need for improved multi-class classification models, such as by investigating deep learning models. Gene-expression-based cancer data is characterised by the small size of available datasets, class imbalance, and high dimensionality. Class balancing methods must be applied to the dataset. Since all the genes are not necessary for stage prediction, retaining only the necessary genes can improve classification accuracy. The breast cancer samples are to be classified into 4 classes of stages 1 to 4. Invasive ductal carcinoma breast cancer samples are obtained from The Cancer Genome Atlas (TCGA) and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) datasets and combined. Two class balancing techniques are explored, synthetic minority oversampling technique (SMOTE) and SMOTE followed by random undersampling. A hybrid feature selection pipeline is proposed, with three pipelines explored involving combinations of filter and embedded feature selection methods: Pipeline 1 - minimum-redundancy maximum-relevancy (mRMR) and correlation feature selection (CFS), Pipeline 2 - mRMR, mutual information (MI) and CFS, and Pipeline 3 - mRMR and support vector machine-recursive feature elimination (SVM-RFE). The classification is done using deep learning models, namely deep neural network, convolutional neural network, recurrent neural network, a modified deep neural network, and an AutoKeras generated model. Classification performance post class-balancing and various feature selection techniques show marked improvement over classification prior to feature selection. The best multiclass classification was found to be by a deep neural network post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with a Cohen-Kappa score of 0.303 and a classification accuracy of 53.1%. For binary classification into early and late-stage cancer, the best performance is obtained by a modified deep neural network (DNN) post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with an accuracy of 81.0% and a Cohen-Kappa score (CKS) of 0.280. This pipeline also showed improved multiclass classification performance on neuroblastoma cancer data, with a best area under the receiver operating characteristic (auROC) curve score of 0.872, as compared to 0.71 obtained in previous work, an improvement of 22.81%. The results and analysis reveal that feature selection techniques play a vital role in gene-expression data-based classification, and the proposed hybrid feature selection pipeline improves classification performance. Multi-class classification is possible using deep learning models, though further improvement particularly in late-stage classification is necessary and should be explored further.

Application of Feature Selection and Deep Learning for Cancer Prediction Using DNA Methylation Markers

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture

A hybrid metaheuristic-deep learning technique for the pan-classification of cancer based on DNA methylation

Abstract 6697: Deep learning algorithm for cancer detection using multimodal characteristics of whole methylome sequencing of cf-DNA

A new parsimonious method for classifying Cancer Tissue-of-Origin Based on DNA Methylation 450K data

A deep embedded refined clustering approach for breast cancer distinction based on DNA methylation

Accurate Prediction of Pan-Cancer Types Using Machine Learning with Minimal Number of DNA Methylation Sites.

Prediction of epigenetically regulated genes in breast cancer cell lines

Application of deep learning in cancer epigenetics through DNA methylation analysis

A machine learning-based method for feature reduction of methylation data for the classification of cancer tissue origin

Integrative analysis of DNA methylation and gene expression profiles identified potential breast cancer-specific diagnostic markers

Deep-Learning-Based Cancer Profiles Classification Using Gene Expression Data Profile

Deep Neural Network for Analysis of DNA Methylation Data

Enhancing Breast Cancer Prediction through Deep Learning and Comparative Analysis of Gene Expression and DNA Methylation Data using Convolutional Neural Networks

Diagnostic classification based on DNA methylation profiles using sequential machine learning approaches.

Deep Learning for Cancer Type Classification and Driver Gene Identification

Identifying Epigenetic Signature of Breast Cancer with Machine Learning

DNA Methylation Markers for Pan-Cancer Prediction by Deep Learning

A Self-attention Graph Convolutional Network for Precision Multi-tumor Early Diagnostics with DNA Methylation Data

Advancing Breast Cancer Subtype Prediction and Mutation Analysis: Integrating Deep Learning and Machine Learning Techniques in Genomic Research

An Intelligent Classification System for Cancer Detection Based on DNA Methylation Using ML and Semantic Knowledge in Healthcare