Feature Selection Using Lasso Regression Enhances Deep Learning Model Performance For Diagnosis Of Lung Cancer from Transcriptomic Data

Souvik Guha
DOI: https://doi.org/10.1101/2024.05.01.592076
2024-05-04
Abstract:Cancer is a genetic disease where gene mutations are pivotal in disease initiation and pathophysiology. The gene expression profile follows a specific pattern exclusive to each cancer which can be utilized for early and accurate diagnosis. Microarray techniques have emerged as powerful tools capable of simultaneously capturing the expression profiles of thousands of genes. However, because of the high dimensionality of the produced transcriptome data, analysis of the resulting datasets is challenging. Recent advancements in Artificial Intelligence (AI) techniques like Machine Learning (ML) and Deep Learning can be instrumental in efficiently processing these high-dimensional datasets. LASSO-regression is a ML technique that can help to rank the features which could help in feature selection leading to dimensionality reduction. Deep Learning is one of the most sophisticated ML techniques that can process high-dimensional data owing to the presence of more number of hidden layers in its neural network. We designed a Deep Neural Network (DNN) classifier model fused with a LASSO-based significant feature extractor for classifying the gene expression dataset containing a total of 51 samples of which 24 samples are of lung cancer patients and the remaining 27 samples are of normal individuals. A LASSO regression model was implemented to identify the genes that played a significant role in the classification. These significant gene expressions were then fed into a convergent Deep Neural Architecture. The classifier was trained with 70% data and the rest 30% was used for validation. The proposed classifier proved to provide better classification as compared to LASSO regression and DNN used individually. The two classes were classified with an average accuracy of 96.25%, average precision of 99.67%, average specificity of 99.45% and average sensitivity of 91.73% measured over thirty independent assessments. In some cases, the model was able to obtain a classification accuracy of 100%. This could open the path to early and better diagnosis of cancers from transcriptome data.
Bioinformatics
What problem does this paper attempt to address?
The main objective of this paper is to develop a method that combines Lasso regression feature selection with a deep learning classifier to improve the accuracy of diagnosing lung cancer from transcriptome data. The core contributions of the paper include: 1. **Problem Background**: Cancer is a genetic disease where gene mutations play a key role in the occurrence and development of the disease. Microarray technology can capture the expression profiles of thousands of genes simultaneously, but analyzing these datasets is challenging due to the high dimensionality of the resulting transcriptome data. 2. **Research Method**: The authors designed a model that integrates Lasso regression feature selection with a deep neural network (DNN) classifier. Lasso regression is used to identify important genes that significantly impact classification, and these genes are then fed into a deep neural network for final classification. 3. **Experimental Results**: By validating on a dataset containing 51 samples, including 24 samples from lung cancer patients and 27 samples from healthy individuals, the model achieved an average classification accuracy of 96.25%, an average precision of 99.67%, an average specificity of 99.45%, and an average sensitivity of 91.73%. In some cases, the model even achieved 100% classification accuracy. 4. **Significance and Application**: This study demonstrates that feature selection can significantly enhance the performance of deep learning models when dealing with high-dimensional biomedical data and may provide a new approach for the early diagnosis of lung cancer based on transcriptome data. In short, the paper proposes a method that uses Lasso regression for feature selection to enhance the performance of deep learning models, aiming to improve the accuracy of diagnosing lung cancer from transcriptome data.