Abstract:Prediction of the stage of cancer plays an important role in planning the course of treatment and has been largely reliant on imaging tools which do not capture molecular events that cause cancer progression. Gene-expression data-based analyses are able to identify these events, allowing RNA-sequence and microarray cancer data to be used for cancer analyses. Breast cancer is the most common cancer worldwide, and is classified into four stages - stages 1, 2, 3, and 4 [2]. While machine learning models have previously been explored to perform stage classification with limited success, multi-class stage classification has not had significant progress. There is a need for improved multi-class classification models, such as by investigating deep learning models. Gene-expression-based cancer data is characterised by the small size of available datasets, class imbalance, and high dimensionality. Class balancing methods must be applied to the dataset. Since all the genes are not necessary for stage prediction, retaining only the necessary genes can improve classification accuracy. The breast cancer samples are to be classified into 4 classes of stages 1 to 4. Invasive ductal carcinoma breast cancer samples are obtained from The Cancer Genome Atlas (TCGA) and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) datasets and combined. Two class balancing techniques are explored, synthetic minority oversampling technique (SMOTE) and SMOTE followed by random undersampling. A hybrid feature selection pipeline is proposed, with three pipelines explored involving combinations of filter and embedded feature selection methods: Pipeline 1 - minimum-redundancy maximum-relevancy (mRMR) and correlation feature selection (CFS), Pipeline 2 - mRMR, mutual information (MI) and CFS, and Pipeline 3 - mRMR and support vector machine-recursive feature elimination (SVM-RFE). The classification is done using deep learning models, namely deep neural network, convolutional neural network, recurrent neural network, a modified deep neural network, and an AutoKeras generated model. Classification performance post class-balancing and various feature selection techniques show marked improvement over classification prior to feature selection. The best multiclass classification was found to be by a deep neural network post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with a Cohen-Kappa score of 0.303 and a classification accuracy of 53.1%. For binary classification into early and late-stage cancer, the best performance is obtained by a modified deep neural network (DNN) post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with an accuracy of 81.0% and a Cohen-Kappa score (CKS) of 0.280. This pipeline also showed improved multiclass classification performance on neuroblastoma cancer data, with a best area under the receiver operating characteristic (auROC) curve score of 0.872, as compared to 0.71 obtained in previous work, an improvement of 22.81%. The results and analysis reveal that feature selection techniques play a vital role in gene-expression data-based classification, and the proposed hybrid feature selection pipeline improves classification performance. Multi-class classification is possible using deep learning models, though further improvement particularly in late-stage classification is necessary and should be explored further.

A Systematic Approach to Featurization for Cancer Drug Sensitivity Predictions with Deep Learning

Pan-Cancer Drug Sensitivity Prediction from Gene Expression using Deep Learning

A systematic evaluation of deep learning methods for the prediction of drug synergy in cancer

Developing Anticancer Drug Response System Using Deep Learning System with Hybrid Genomic and Chemical Features

A comprehensive benchmarking of machine learning algorithms and dimensionality reduction methods for drug sensitivity prediction

Prediction of drug sensitivity based on multi-omics data using deep learning and similarity network fusion approaches

Predicting drug response of tumors from integrated genomic profiles by deep neural networks

Precision Anti-Cancer Drug Selection via Neural Ranking

Prediction of anti-cancer drug synergy based on cross-matching network and cancer molecular subtypes

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture

Deep Learning for Cancer Type Classification and Driver Gene Identification

Deep learning-based multi-drug synergy prediction model for individually tailored anti-cancer therapies

Leveraging a Joint of Phenotypic and Genetic Features on Cancer Patient Subgrouping

Anti-cancer Drug Synergy Prediction in Understudied Tissues using Transfer Learning

Impact of Molecular Representations on Deep Learning Model Comparisons in Drug Response Predictions

SYSTEMATIC ASSESSMENT OF ANALYTICAL METHODS FOR DRUG SENSITIVITY PREDICTION FROM CANCER CELL LINE DATA

Pathway-Guided Deep Neural Network toward Interpretable and Predictive Modeling of Drug Sensitivity

A Deep Neural Network for Predicting Synergistic Drug Combinations on Cancer

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

Learning Curves for Drug Response Prediction in Cancer Cell Lines