Tumor origin identification through machine learning and gene expression profiling.
Xin Wang,Jun Li,Yanqing Zhou,Yaru Chen,Tianliang Liu,Shifu Chen
DOI: https://doi.org/10.1200/jco.2024.42.16_suppl.e13597
IF: 45.3
2024-05-31
Journal of Clinical Oncology
Abstract:e13597 Background: Accurate identification of tumor origin is crucial for effective diagnosis and treatment, particularly in cases of metastatic tumors. Interpretable machine learning models have displayed significant potential in addressing this problem when imaging and immunohistochemistry (IHC) examinations are ineffective. In this study, we developed a panel using gene expression profiles from various tumors and constructed a robust machine learning model for precise identification of tumor origin. Methods: RNA sequencing (RNA-seq) data of 9462 tumor samples originating from 21 different organs were collected from The Cancer Genome Atlas (TCGA). We conducted feature engineering through unsupervised clustering and differential gene expression analysis, selecting a refined panel of 164 genes from a pool of over 60,000 identifiers. Subsequently, a machine learning classifier grounded in Logistic regression (LR) was trained on 9462 samples with the constructed 164-gene panel. To enhance the adaptability of our model to this task, 10-fold cross-validation was employed in the multi-class mode. Two independent test sets, the Primary tumor set (PT, n=3420, including 19 tumor types) and the Metastatic tumor set (MTP, n=100, all originating from the prostate and spread to bone, liver, etc.), were established with samples from Gene Expression Omnibus (GEO) and other published studies. Notably, all samples underwent FPKM normalization and there was no overlap between training and testing samples. Model performance was assessed using accuracy, specificity, and sensitivity metrics. Results: The 164-gene expression panel achieved a cross-validation accuracy of 96.73%. In the assessment of the PT test set, our model achieved an overall 99.62% specificity, 92.31% accuracy and 89.79% sensitivity, exhibiting performance comparable to similar models in other studies. Additionally, our model accurately traced 91% of metastatic tumors in the MTP test set to the prostate, surpassing previous lines of work by a large margin. Conclusions: Gene expression patterns reveal organ-specific characteristics that could be used to identify tumor origin. The combination of a condensed yet comprehensive gene expression panel with a robust machine learning model serves as a promising tool for tumor diagnosis. Ongoing correlative studies aim to extend predictions from tissue organs to cancer subtypes.
oncology