Classifying Lung Adenocarcinoma and Squamous Cell Carcinoma Using RNA-Seq Data

Zhengyan Huang,Li Chen,Chi Wang
DOI: https://doi.org/10.17140/csmmoj-3-120
2017-01-01
Abstract:Background: Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are two primary subtypes of non-small cell lung carcinoma (NSCLC). Currently, the most widely used method to discriminate between LUAD and LUSC is hematoxylin-eosin (HE) staining. However, this method sometimes is unable to make the precise diagnosis on LUAD or LUSC. More accurate diagnostic approaches are highly desired. Methods: We propose to use gene expression profile to discriminate NSCLC patient’s subtype. We leveraged RNA-Seq data from The Cancer Genome Atlas (TCGA) and randomly split the data into training and testing subsets. To construct classifiers based on the training data, we considered three methods: logistic regression on principal components (PCR), logistic regression with LASSO shrinkage (LASSO), and kth nearest neighbors (KNN). Performances of classifiers were evaluated and compared based on the testing data. Results: All gene expression-based classifiers show high accuracy in discriminating LUSC and LUAD. The classifier obtained by LASSO has the smallest overall misclassification rate of 3.42% (95% CI: 3.25%-3.60%) when using 0.5 as the cutoff value for the predicted probability of belonging to a subtype, followed by classifiers obtained by PCR (4.36%, 95% CI: 4.23%4.49%) and KNN (8.70%, 95% CI: 8.57%-8.83%). The LASSO classifier also has the highest average area under the receiver operating characteristic curve (AUC) value of 0.993, compared to PCR (0.987) and KNN (0.965). Conclusions: Our results suggest that mRNA expressions are highly informative for classifying NSCLC subtypes and may potentially be used to assist clinical diagnosis.
What problem does this paper attempt to address?