ESTIMATING THE NUMBER OF CANCER SUBTYPES FROM WHOLE-GENOME EXPRESSION DATA VIA A PENALIZED PROBABILISTIC PRINCIPAL COMPONENT ANALYSIS ∗ By

Wei Q. Deng,Radu V. Craiu
2019-01-01
Abstract:Large expression datasets from whole-genome microarray experiments have revolutionized the modern categorization of cancer tumours according to molecular subtypes, shown to link better with disease outcomes and treatment responses. However, one of the unresolved problems is the number of molecular subtypes in a given sample of patients. Here we tackle the number of unknown subtypes via the estimation of effective dimension in cancer patients with a large number of gene expression features. The sample covariance is decomposed to establish a low-rank approximation that helps uncover the hidden structure. We develop a penalized profile likelihood criterion embedded in a probabilistic principal components analysis to estimate the rank of the decomposition. The choice of the penalty parameter is guided by a data-driven procedure that is justified via analytical derivations and extensive finite sample simulations. Application of the proposed penalized approach is illustrated with three gene expression datasets for breast, colorectal, and ovarian cancer, respectively. In these analyses, the numbers of molecular subtypes were estimated from whole-genome expression measurements without gene feature selection. The results point towards hidden structures, e.g. additional subgroups, that could be of scientific interest in advancing towards personalized medicine.
What problem does this paper attempt to address?