Abstract:The Cancer Proteome Atlas (TCPA) project collects reverse-phase protein arrays (RPPA)-based proteome datasets from nearly 8000 samples across 32 cancer types. This study aims to investigate the pan-cancer proteome signature and identify cancer subtypes of glioma, kidney cancer, and lung cancer based on TCPA data. We first visualized the tumor clustering models using t-distributed stochastic neighbour embedding (t-SNE) and biclustering heatmap. Then, three feature selection methods (pyHSICLasso, XGBoost, and Random Forest) were performed to select protein features for classifying cancer subtypes in training dataset, and the LibSVM algorithm was empolyed to test classification accuracy in the validation dataset. Clustering analysis revealed that different kinds of tumors have relatively distinct proteomic profiling based on tissue or origin. We identified 20, 10, and 20 protein features with the highest accuracies in classifying subtypes of glioma, kidney cancer, and lung cancer, respectively. The predictive abilities of the selected proteins were confirmed by receiving operating characteristic (ROC) analysis. Finally, the Bayesian network was utilized to explore the protein biomarkers that have direct causal relationships with cancer subtypes. Overall, we highlight the theoretical and technical applications of machine learning based feature selection approaches in the analysis of high-throughput biological data, particularly for cancer biomarker research. Significance: Functional proteomics is a powerful approach for characterizing cell signaling pathways and understanding their phenotypic effects on cancer development. The TCPA database provides a platform to explore and analyze TCGA pan-cancer RPPA-based protein expression. With the advent of the RPPA technology, the availability of high-throughput data in TCPA platform has made it possible to use machine learning methods to identify protein biomarkers and further differentiate subtypes of cancer based on proteomic data. In this study, we highlight the role of feature selection and Bayesian network in discovery protein biomarker for classifying cancer subtypes based on functional proteomic data. The application of machine learning methods in the analysis of high-throughput biological data, particularly for cancer biomarker researches, which have potential clinical values in developing individualized treatment strategies.

Combining Gene Essentiality with Feature Selection Method to Explore Multi-Cancer Biomarkers

A Hybrid Feature Selection Algorithm and Its Application in Bioinformatics

Identification of Pan-Cancer Biomarkers Based on the Gene Expression Profiles of Cancer Cell Lines

Novel Model for Comprehensive Assessment of Robust Prognostic Gene Signature in Ovarian Cancer Across Different Independent Datasets

Identification of Pan-Cancer Prognostic Biomarkers Through Integration of Multi-Omics Data

Using feature selection and Bayesian network identify cancer subtypes based on proteomic data

Robust Biomarker Discovery for Hepatocellular Carcinoma from High-Throughput Data by Multiple Feature Selection Methods

Identifying Diagnostic Biomarkers of Breast Cancer Based on Gene Expression Data and Ensemble Feature Selection

FS–GBDT: identification multicancer-risk module via a feature selection algorithm by integrating Fisher score and GBDT

Functional and Embedding Feature Analysis for Pan-Cancer Classification

Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression.

A novel multi-stage feature selection method for microarray expression data analysis.

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

A Robust Fuzzy Rule Based Integrative Feature Selection Strategy for Gene Expression Data in TCGA

Novel Hybrid Method for Gene Selection and Cancer Prediction

Incorporating gene co-expression network in identification of cancer prognosis markers

Investigating Multi-Cancer Biomarkers And Their Cross-Predictability In The Expression Profiles Of Multiple Cancer Types

An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data.

Extracting Multi-Function Features for Cancer Genes

A feature extraction framework for discovering pan‐cancer driver genes based on multi‐omics data