A Self-Training Subspace Clustering Algorithm under Low-Rank Representation for Cancer Classification on Gene Expression Data

Chun-Qiu Xia,Ke Han,Yong Qi,Yang Zhang,Dong-Jun Yu
DOI: https://doi.org/10.1109/TCBB.2017.2712607
2018-07-01
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Abstract:Accurate identification of the cancer types is essential to cancer diagnoses and treatments. Since cancer tissue and normal tissue have different gene expression, gene expression data can be used as an efficient feature source for cancer classification. However, accurate cancer classification directly using original gene expression profiles remains challenging due to the intrinsic high-dimension feature and the small size of the data samples. We proposed a new self-training subspace clustering algorithm under low-rank representation, called SSC-LRR, for cancer classification on gene expression data. Low-rank representation LRR is first applied to extract discriminative features from the high-dimensional gene expression data; the self-training subspace clustering SSC method is then used to generate the cancer classification predictions. The SSC-LRR was tested on two separate benchmark datasets in control with four state-of-the-art classification methods. It generated cancer classification predictions with an overall accuracy 89.7 percent and a general correlation 0.920, which are 18.9 and 24.4 percent higher than that of the best control method respectively. In addition, several genes RNF114, HLA-DRB5, USP9Y, and PTPN20 were identified by SSC-LRR as new cancer identifiers that deserve further clinical investigation. Overall, the study demonstrated a new sensitive avenue to recognize cancer classifications from large-scale gene expression data.
What problem does this paper attempt to address?