Computational Analysis of Leukemia Microarray Expression Data Using the GA/KNN Method

Leping Li,Lee G. Pedersen,Thomas A. Darden,Clarice R. Weinberg
DOI: https://doi.org/10.1007/978-1-4615-0873-1_7
2002-01-01
Abstract:We recently developed a multivariate method that selects a subset of discriminative genes for sample classification based on gene expression data. The method combines a search tool, a genetic algorithm (GA), and a non-parametric pattern recognition method, based on the k-nearest nearest neighbors (KNN). We begin by selecting many subsets of genes that can discriminate among classes of samples using a training set. Subsequently, the genes are ranked according to the frequency of gene selection. The top- ranked genes (e.g. 50) are then used to classify test set samples. For a widely-available set of leukemia data, the top 50 genes identified by the GA/KNN method not only correctly classified 33 of the 34 test set samples, but also discovered the two distinct clinical subtypes within ALL without applying prior knowledge. The method has been successfully applied to several expression data sets. It may be used to identify a subset of informative genes (biomarkers) for sample classification for a variety of profiling studies including tumors.
What problem does this paper attempt to address?