Abstract:The NCI's The Cancer Genome Atlas (TCGA) project profiled over 10,000 tumor samples over the course of 10 years. As different tissue-specific working groups reviewed all of the available data, these patient samples were separated into distinct molecular subtypes, and these clusters were reported in various marker papers. While these assignments provided invaluable information about the common patterns of molecular characteristics in different types of cancer there was no consistent methodology for assigning new samples to these defined molecular subtypes.The NCI's Tumor Molecular Pathology group was formulated to create machine learning-based models that could be applied to non-TCGA samples and determine their TCGA mapped subtypes. Five modeling systems, JADBio, SKGrid by the Oregon Health and Science University, CloudForest by the Institute of Systems Biology, AKLIMATE by University of California Santa Cruz and subSCOPE by BC Cancer's Genome Sciences Centre, were trained to recognize TCGA subtypes using multi-omic measurements from gene expression, DNA methylation, miRNA expression, copy number, and somatic mutation calls. While the TCGA samples were profiled using multi-omic technologies, single platform and/or compact feature set models also were assessed for their ability to assign these classifications. Each machine learning system created predictive models for 106 subtypes from 26 cancer types using as few features as possible, with a maximum of 100 features allowed for scored models. A set of 411,706 models was developed, composed of results of each of the learning methods across the various omic platforms. Top models, both multi-omic and single platform, were selected for each cancer type. On average, models were able to achieve an overall weighted F1 score of 0.895 with 42 features. While the top models for each cancer type had an overall weighted F1 mean performance of 0.936 with a mean of 29 features, in 20 of the 26 cancer types models using only gene expression provided the best performance. Analysis of features selected by the models showed some known onco-drivers were selected by many models, but many times different models would utilize features of different genes with similar levels of performance. Network-level analysis revealed that many genes of these selected features operated within the same pathways.Transferability of these models to external datasets was tested, taking TCGA breast cancer trained models and applying them to AURORA and METABRIC datasets. Interestingly, despite the data platform difference between TCGA (RNAseq) and METABRIC (microarray), model performance saw only minimal degradation of F1 values in transfer. This set of models and the training dataset will provide new opportunities for researchers and translational scientists to connect new tumors to the subtypes seen in the TCGA cohorts. Citation Format: Kyle Ellrott, Chris K. Wong, Christina Yau, Mauro A. Castro, Jordan Lee, Brian Karlberg, Jasleen K. Grewal, Vincenzo Lagani, Bahar Tercan, Verena Friedl, Toshinori Hinoue, Vladislav Uzunangelov, Lindsay Westlake, Xavier Loinaz, Ina Felau, Peggy Wang, Anab Kemal, Samantha J. Caesar-Johnson, Ilya Shmulevich, Alexander J. Lazar, Ioannis Tsamardinos, Katherine A. Hoadley, The Cancer Genome Atlas Analysis Network, Gordon A. Robertson, Theo A. Knijnenburg, Christopher C. Benz, Joshua M. Stuart, Jean C. Zenklusen, Andrew D. Cherniack, Peter W. Laird. Leveraging compact feature sets for TCGA-based molecular subtype classification on new samples [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 6548.

An algorithm for classifying tumors based on genomic aberrations and selecting representative tumor models

A CNV Computational Model for Clonal Origin Analysis of Synchronous Multifocal Hepatobiliary and Pancreatic Tumors.

Cancer classification and pathway discovery using non-negative matrix factorization

Classification of cancers based on copy number variation landscapes

Network Based Stratification of Major Cancers by Integrating Somatic Mutation and Gene Expression Data.

Combining chromosomal arm status and significantly aberrant genomic locations reveals new cancer subtypes

Identification of Breast Cancer Subtypes by Integrating Genomic Analysis with the Immune Microenvironment

Identification of Genomic Aberrations in Cancer Subclones from Heterogeneous Tumor Samples

Generation of an Algorithm Based on Minimal Gene Sets to Clinically Subtype Triple Negative Breast Cancer Patients

Abstract 6548: Leveraging compact feature sets for TCGA-based molecular subtype classification on new samples

Phylogeny-based tumor subclone identification using a Bayesian feature allocation model

Multiclass cancer diagnosis using tumor gene expression signatures

Cancer classification based on multiple dimensions: SNV patterns

Spectral Clustering Using Nyström Approximation for the Accurate Identification of Cancer Molecular Subtypes.

Development and validation of a reliable DNA copy-number-based machine learning algorithm (CopyClust) for breast cancer integrative cluster classification

Abstract 2682: Integration of Multiple Prognostic Factors into the TNM Staging System Using a New Algorithm That Censors Patients

Abstract 5045: Genomics and pathology based deep learning to predict cancers of unknown primary

Genome-Wide Identification of Somatic Aberrations from Paired Normal-Tumor Samples

Subtype-GAN: a deep learning approach for integrative cancer subtyping of multi-omics data

A unified computational model for revealing and predicting subtle subtypes of cancers