Abstract:Prediction of the stage of cancer plays an important role in planning the course of treatment and has been largely reliant on imaging tools which do not capture molecular events that cause cancer progression. Gene-expression data-based analyses are able to identify these events, allowing RNA-sequence and microarray cancer data to be used for cancer analyses. Breast cancer is the most common cancer worldwide, and is classified into four stages - stages 1, 2, 3, and 4 [2]. While machine learning models have previously been explored to perform stage classification with limited success, multi-class stage classification has not had significant progress. There is a need for improved multi-class classification models, such as by investigating deep learning models. Gene-expression-based cancer data is characterised by the small size of available datasets, class imbalance, and high dimensionality. Class balancing methods must be applied to the dataset. Since all the genes are not necessary for stage prediction, retaining only the necessary genes can improve classification accuracy. The breast cancer samples are to be classified into 4 classes of stages 1 to 4. Invasive ductal carcinoma breast cancer samples are obtained from The Cancer Genome Atlas (TCGA) and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) datasets and combined. Two class balancing techniques are explored, synthetic minority oversampling technique (SMOTE) and SMOTE followed by random undersampling. A hybrid feature selection pipeline is proposed, with three pipelines explored involving combinations of filter and embedded feature selection methods: Pipeline 1 - minimum-redundancy maximum-relevancy (mRMR) and correlation feature selection (CFS), Pipeline 2 - mRMR, mutual information (MI) and CFS, and Pipeline 3 - mRMR and support vector machine-recursive feature elimination (SVM-RFE). The classification is done using deep learning models, namely deep neural network, convolutional neural network, recurrent neural network, a modified deep neural network, and an AutoKeras generated model. Classification performance post class-balancing and various feature selection techniques show marked improvement over classification prior to feature selection. The best multiclass classification was found to be by a deep neural network post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with a Cohen-Kappa score of 0.303 and a classification accuracy of 53.1%. For binary classification into early and late-stage cancer, the best performance is obtained by a modified deep neural network (DNN) post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with an accuracy of 81.0% and a Cohen-Kappa score (CKS) of 0.280. This pipeline also showed improved multiclass classification performance on neuroblastoma cancer data, with a best area under the receiver operating characteristic (auROC) curve score of 0.872, as compared to 0.71 obtained in previous work, an improvement of 22.81%. The results and analysis reveal that feature selection techniques play a vital role in gene-expression data-based classification, and the proposed hybrid feature selection pipeline improves classification performance. Multi-class classification is possible using deep learning models, though further improvement particularly in late-stage classification is necessary and should be explored further.

Deep Learning Based Model for Breast Cancer Subtype Classification

Advancing Breast Cancer Subtype Prediction and Mutation Analysis: Integrating Deep Learning and Machine Learning Techniques in Genomic Research

Predicting Breast Cancer Gene Expression Signature by Applying Deep Convolutional Neural Networks From Unannotated Pathological Images

Automated Molecular Subtyping of Breast Carcinoma Using Deep Learning Techniques

A Deep Learning Model for Predicting Molecular Subtype of Breast Cancer by Fusing Multiple Sequences of DCE-MRI From Two Institutes

Deep Learning for identifying radiogenomic associations in breast cancer

Leveraging Deep Learning Techniques and Integrated Omics Data for Tailored Treatment of Breast Cancer

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture

Deep learning-based classification of breast cancer molecular subtypes from H&E whole-slide images

Diagnosis of breast cancer molecular subtypes using machine learning models on unimodal and multimodal datasets

Deep Learning and Transfer Learning Identify Breast Cancer Survival Subtypes from Single-Cell Imaging Data.

Deep Learning Techniques for Subtype Classification and Prognosis in Breast Cancer Genomics: A Systematic Review and Meta-Analysis

Performance Comparison of Deep Learning Autoencoders for Cancer Subtype Detection Using Multi-Omics Data

Comparative Evaluation of Machine Learning Models for Subtyping Triple-Negative Breast Cancer: A Deep Learning-Based Multi-Omics Data Integration Approach

Biomarker Gene Identification for Breast Cancer Classification

L. monocytogenes oligonucleotide probe

Enhancing Breast Cancer Prediction through Deep Learning and Comparative Analysis of Gene Expression and DNA Methylation Data using Convolutional Neural Networks

Dual-path convolutional neural network using micro-FTIR imaging to predict breast cancer subtypes and biomarkers levels: estrogen receptor, progesterone receptor, HER2 and Ki67

Breast Cancer Prediction Using Deep Learning and Machine Learning Techniques

Intelligent Deep Learning Framework for Breast Cancer Prediction using Feature Ensemble Learning

Identification of Luminal A breast cancer by using deep learning analysis based on multi-modal images