Pattern Recognition in Mining High-Throughput Genomics/Proteomics Data: The New Challenges in Old Questions

Xuegong Zhang
DOI: https://doi.org/10.1109/ICCTA.2007.103
2007-01-01
Abstract:Summary form only given. The current molecular biology and systems biology is featured by the rapid accumulation of high-throughput genomics and proteomics data like microarray and mass spectrometry (MS) data. Through our study on microarray and MS data, we have observed that the cancer classification and gene/biomarker selection task has many unique characteristics that distinguish itself from other standard pattern recognition tasks. Due to the extremely small sample size, the reliable assessment of the classification accuracy becomes a major question. For gene/biomarker selection, a key question is the significance of the selected genes/marker. We studied these questions with both simulated and real microarray and MS data. We developed a perturbation-based method for estimating the distribution of error rates of a support vector machine classifier. For evaluating the statistical significance of gene lists selected by sophisticated machine learning methods, we defined the problem of rank significance of genes and developed a heuristic strategy for estimating this significance. These questions highlight two important aspects of the pattern recognition problems in high-throughput computational molecular biology. The awareness of such questions is a key for properly applying computational methods to practical data and for developing new methods that really target the scientific questions
What problem does this paper attempt to address?