Deep Learning Based Model for Breast Cancer Subtype Classification

Sheetal Rajpal,Virendra Kumar,Manoj Agarwal,Naveen Kumar
DOI: https://doi.org/10.48550/arXiv.2111.03923
2021-11-10
Abstract:Breast cancer has long been a prominent cause of mortality among women. Diagnosis, therapy, and prognosis are now possible, thanks to the availability of RNA sequencing tools capable of recording gene expression data. Molecular subtyping being closely related to devising clinical strategy and prognosis, this paper focuses on the use of gene expression data for the classification of breast cancer into four subtypes, namely, Basal, Her2, LumA, and LumB. In stage 1, we suggested a deep learning-based model that uses an autoencoder to reduce dimensionality. The size of the feature set is reduced from 20,530 gene expression values to 500 by using an autoencoder. This encoded representation is passed to the deep neural network of the second stage for the classification of patients into four molecular subtypes of breast cancer. By deploying the combined network of stages 1 and 2, we have been able to attain a mean 10-fold test accuracy of 0.907 on the TCGA breast cancer dataset. The proposed framework is fairly robust throughout 10 different runs, as shown by the boxplot for classification accuracy. Compared to related work reported in the literature, we have achieved a competitive outcome. In conclusion, the proposed two-stage deep learning-based model is able to accurately classify four breast cancer subtypes, highlighting the autoencoder's capacity to deduce the compact representation and the neural network classifier's ability to correctly label breast cancer patients.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The problem this paper attempts to address is the molecular subtyping of breast cancer using gene expression data. Specifically, the study aims to develop a deep learning-based model to classify breast cancer into four subtypes: Basal, Her2, LumA, and LumB. This classification is significant for clinical strategy formulation and prognosis evaluation. ### Main Research Content: 1. **Background**: - Breast cancer is one of the leading causes of death among women, with the number of diagnosed cases surpassing lung cancer in 2020. - Gene expression data can capture the maximum variation of tumors, but the high-dimensional data poses challenges for analysis. - Molecular subtyping has shown superior performance in clinical and prognostic aspects, defining four subtypes: Basal, Her2, LumA, and LumB. 2. **Methods**: - **First Stage**: Dimensionality reduction using an autoencoder, reducing 20,530 gene expression values to 500 features. - **Second Stage**: Feeding the reduced features into a deep neural network for breast cancer subtype classification. - **Dataset**: Using the TCGA breast cancer dataset, which includes gene expression data from 1218 patients, ultimately using 837 samples for experiments. - **Preprocessing**: Applying z-score normalization to the gene expression data and using SMOTE technique to address class imbalance. 3. **Experimental Results**: - Using 10-fold cross-validation, the model's average test accuracy is 0.907. - The model's classification performance is very stable across different runs, with accuracy ranging from 0.879 to 0.939. - Compared to related work, the model's classification performance is competitive. 4. **Conclusion**: - The proposed two-stage deep learning model can effectively utilize gene expression data for breast cancer subtyping. - The autoencoder can extract compact feature representations, while the deep neural network can accurately classify breast cancer patients. - Future work plans to apply this model to the classification of other cancer types and to combine genomic and epigenomic data for research.