Highly accurate two-gene signature for gastric cancer

Zhi Yan,Weitian Xu,Yimin Xiong,Yi Cheng,Hualin Xu,Zhigang Wang,Guorong Zheng
DOI: https://doi.org/10.1007/s12032-013-0584-x
2013-01-01
Medical Oncology
Abstract:Large amount of expression data were generated by high-throughput experimental techniques such as microarray. Single algorithm cannot be widely accepted as suitable method for mining of gene expression data. Therefore, integration of different algorithms and extraction of more useful information from the expression data are the key problems for identification of biomarkers. Here, we used three machine learning algorithms to select feature genes based on gene profiling data of gastric cancer (GC). Then, a common divisor was extracted as candidate feature genes aggregation for Tree Building and Tree Pruning analysis by Decision Tree (DT) algorithm. Real-time quantitative PCR and immunohistochemistry (IHC) staining were used to validate the relative expression levels of the candidate feature genes. Receiver operating characteristic curves were used to analyse the classification sensitivity and specificity of the feature genes. A total of 174, 202, 149 feature genes were selected by Class Information Index, Information Gain Index and Relief algorithms, with a common divisor consisting of 32 genes. Using a DT algorithm to contribute to the classification rule sets, we identified COL2A1 and ATP4B as candidate biomarkers of GC. The expression levels of these two genes were validated by real-time PCR and IHC with high sensitivity (>90 %) and specificity (>90 %) in both training and test samples. We first introduced an integral and systematic data-mining model for identification of biomarkers based on gene expression data. The two-gene signature obtained by our predictive model could be used for recognizing the biological characteristic of GC.
What problem does this paper attempt to address?