Abstract:The NCI's The Cancer Genome Atlas (TCGA) project profiled over 10,000 tumor samples over the course of 10 years. As different tissue-specific working groups reviewed all of the available data, these patient samples were separated into distinct molecular subtypes, and these clusters were reported in various marker papers. While these assignments provided invaluable information about the common patterns of molecular characteristics in different types of cancer there was no consistent methodology for assigning new samples to these defined molecular subtypes.The NCI's Tumor Molecular Pathology group was formulated to create machine learning-based models that could be applied to non-TCGA samples and determine their TCGA mapped subtypes. Five modeling systems, JADBio, SKGrid by the Oregon Health and Science University, CloudForest by the Institute of Systems Biology, AKLIMATE by University of California Santa Cruz and subSCOPE by BC Cancer's Genome Sciences Centre, were trained to recognize TCGA subtypes using multi-omic measurements from gene expression, DNA methylation, miRNA expression, copy number, and somatic mutation calls. While the TCGA samples were profiled using multi-omic technologies, single platform and/or compact feature set models also were assessed for their ability to assign these classifications. Each machine learning system created predictive models for 106 subtypes from 26 cancer types using as few features as possible, with a maximum of 100 features allowed for scored models. A set of 411,706 models was developed, composed of results of each of the learning methods across the various omic platforms. Top models, both multi-omic and single platform, were selected for each cancer type. On average, models were able to achieve an overall weighted F1 score of 0.895 with 42 features. While the top models for each cancer type had an overall weighted F1 mean performance of 0.936 with a mean of 29 features, in 20 of the 26 cancer types models using only gene expression provided the best performance. Analysis of features selected by the models showed some known onco-drivers were selected by many models, but many times different models would utilize features of different genes with similar levels of performance. Network-level analysis revealed that many genes of these selected features operated within the same pathways.Transferability of these models to external datasets was tested, taking TCGA breast cancer trained models and applying them to AURORA and METABRIC datasets. Interestingly, despite the data platform difference between TCGA (RNAseq) and METABRIC (microarray), model performance saw only minimal degradation of F1 values in transfer. This set of models and the training dataset will provide new opportunities for researchers and translational scientists to connect new tumors to the subtypes seen in the TCGA cohorts. Citation Format: Kyle Ellrott, Chris K. Wong, Christina Yau, Mauro A. Castro, Jordan Lee, Brian Karlberg, Jasleen K. Grewal, Vincenzo Lagani, Bahar Tercan, Verena Friedl, Toshinori Hinoue, Vladislav Uzunangelov, Lindsay Westlake, Xavier Loinaz, Ina Felau, Peggy Wang, Anab Kemal, Samantha J. Caesar-Johnson, Ilya Shmulevich, Alexander J. Lazar, Ioannis Tsamardinos, Katherine A. Hoadley, The Cancer Genome Atlas Analysis Network, Gordon A. Robertson, Theo A. Knijnenburg, Christopher C. Benz, Joshua M. Stuart, Jean C. Zenklusen, Andrew D. Cherniack, Peter W. Laird. Leveraging compact feature sets for TCGA-based molecular subtype classification on new samples [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 6548.

CAsubtype: an R Package to Identify Gene Sets Predictive of Cancer Subtypes and Clinical Outcomes

CancerSubtypes: an R/Bioconductor Package for Molecular Cancer Subtype Identification, Validation and Visualization.

SubtypeDrug: a software package for prioritization of candidate cancer subtype-specific drugs

CBioProfiler: a web and standalone pipeline for cancer biomarker and subtype characterization

iSubGen generates integrative disease subtypes by pairwise similarity assessment

Kernel Fusion Method for Detecting Cancer Subtypes via Selecting Relevant Expression Data.

GSCA: an integrated platform for gene set cancer analysis at genomic, pharmacogenomic and immunogenomic levels

DrGaP: a powerful tool for identifying driver genes and pathways in cancer sequencing studies.

An Integrated Approach for Identifying Molecular Subtypes in Human Colon Cancer Using Gene Expression Data

GCclassifier: an R package for the prediction of molecular subtypes of gastric cancer

Sccancer: a Package for Automated Processing of Single-Cell RNA-seq Data in Cancer

psSubpathway: a software package for flexible identification of phenotype-specific subpathways in cancer progression

scCancer2: data-driven in-depth annotations of the tumor microenvironment at single-level resolution

Abstract 6548: Leveraging compact feature sets for TCGA-based molecular subtype classification on new samples

Molecular Subtyping of Cancer Based on Distinguishing Co-Expression Modules and Machine Learning

Multi-Omics Data Fusion via a Joint Kernel Learning Model for Cancer Subtype Discovery and Essential Gene Identification

Subtype-GAN: a deep learning approach for integrative cancer subtyping of multi-omics data

Identification of Breast Cancer Subtypes by Integrating Genomic Analysis with the Immune Microenvironment

Identifying and Analyzing Different Cancer Subtypes Using RNA-seq Data of Blood Platelets.

A Contrastive-Learning-Based Deep Neural Network for Cancer Subtyping by Integrating Multi-Omics Data

CEPICS: A Comparison and Evaluation Platform for Integration Methods in Cancer Subtyping