Abstract:The Cancer Proteome Atlas (TCPA) project collects reverse-phase protein arrays (RPPA)-based proteome datasets from nearly 8000 samples across 32 cancer types. This study aims to investigate the pan-cancer proteome signature and identify cancer subtypes of glioma, kidney cancer, and lung cancer based on TCPA data. We first visualized the tumor clustering models using t-distributed stochastic neighbour embedding (t-SNE) and biclustering heatmap. Then, three feature selection methods (pyHSICLasso, XGBoost, and Random Forest) were performed to select protein features for classifying cancer subtypes in training dataset, and the LibSVM algorithm was empolyed to test classification accuracy in the validation dataset. Clustering analysis revealed that different kinds of tumors have relatively distinct proteomic profiling based on tissue or origin. We identified 20, 10, and 20 protein features with the highest accuracies in classifying subtypes of glioma, kidney cancer, and lung cancer, respectively. The predictive abilities of the selected proteins were confirmed by receiving operating characteristic (ROC) analysis. Finally, the Bayesian network was utilized to explore the protein biomarkers that have direct causal relationships with cancer subtypes. Overall, we highlight the theoretical and technical applications of machine learning based feature selection approaches in the analysis of high-throughput biological data, particularly for cancer biomarker research. Significance: Functional proteomics is a powerful approach for characterizing cell signaling pathways and understanding their phenotypic effects on cancer development. The TCPA database provides a platform to explore and analyze TCGA pan-cancer RPPA-based protein expression. With the advent of the RPPA technology, the availability of high-throughput data in TCPA platform has made it possible to use machine learning methods to identify protein biomarkers and further differentiate subtypes of cancer based on proteomic data. In this study, we highlight the role of feature selection and Bayesian network in discovery protein biomarker for classifying cancer subtypes based on functional proteomic data. The application of machine learning methods in the analysis of high-throughput biological data, particularly for cancer biomarker researches, which have potential clinical values in developing individualized treatment strategies.

Selection of Feature Genes in Cancer Clsssification

Cancer Subtype Recognition and Feature Selection with Gene Expression Profiles

Multiclass Cancer Classification by Using Fuzzy Support Vector Machine and Binary Decision Tree with Gene Selection

Parameters Selection in Gene Selection Using Gaussian Kernel Support Vector Machines by Genetic Algorithm

A Feature Selection Method for Colon Tumor Based on Gene Expression Profiles

Feature (gene) Selection in Gene Expression-Based Tumor Classification

On Gene Selection and Classification for Cancer Microarray Data Using Multi-Step Clustering and Sparse Representation

Gene Selection for Cancer Clustering Analysis Based on Expression Data

Using feature selection and Bayesian network identify cancer subtypes based on proteomic data

A Hybrid Gene Selection Method for Cancer Classification Based on Clustering Algorithm and Euclidean Distance

An Ensemble Correlation-Based Gene Selection Algorithm for Cancer Classification with Gene Expression Data

The Classification of Tumor Using Gene Expression Profile Based on Support Vector Machines and Factor Analysis.

Gene Selection Using Genetic Algorithm and Support Vectors Machines

Gene Selection for Leukemia Subtype Classification from Gene Expression Profile

Feature selection for cancer classification based on support vector machine

A Novel Hybrid Method of Gene Selection and Its Application on Tumor Classification

Gene Selection and Cancer Classification Using A Fuzzy Neural Network

Model-free Gene Selection Using Genetic Algorithms

Gene Selection for Cancer Classification using Support Vector Machines

FEATURE SELECTION FOR CLUSTERING DISEASE SAMPLES BASED ON GENE ONTOLOGY

Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles