Abstract:Cancer classification plays an important role in cancer treatment. There has been no general approach for this problem now. The tasks for cancer classification are of two aspects: identifying new cancer classes and assigning tumors to known classes, which are called class discovery and class prediction by Golub et al. [1]. From mathematical point of view, class discovery is a cluster analysis problem, while class prediction is usually called classification problem (we’ll use the later name to keep consist with pattern recognition literatures). Until now, cancer classification has been based primarily on morphological appearance of tumor [1]. This has serious limitations because of ambiguity. Golub et al. presented a new approach to cancer classification based on gene expression monitoring by DNA microarrays in [1]. They chose acute leukemia as a test case, and the target is to distinguish between ALL (acute lymphoblastic leukemia) and AML (acute myeloid leukemia), which is a typical cancer classification problem not well solved despite many years of efforts. This paper is a report of our work on the classification (prediction) part of this problem following their original work. Golub et al. adopted a feature selection (gene selection) procedure before classification. A metric was defined to evaluate the correlation of each gene to the classification. After some “good” genes were selected from all the 6817 genes, the classification is done by a weighted voting scheme. The classifier was trained on a 38-sample training set, and another 34-sample set was used for testing. With leave-one-out cross-validation on the training set with 50 selected genes, 36 out of 38 samples were correctly classified and 2 were rejected (no-call). The performance on the test set was that 29 samples out of 34 were correctly classified and the other 5 were rejected. If the classifier were compelled to give these 5 no-calls a prediction, the prediction would be wrong. Since the feature selection procedure is of single selection type, and the classification method is also an intuitive one, we believe that there is still much space for the performance to be improved. In our approach to the problem, we took all the genes for the classification (the selection problem will be discussed in another paper), and applied the support vector machine(SVM) method and one of its improved version CSVM as the classifier. Thanks to the better generalization ability of SVM and CSVM, much better performance was obtained.

Prokaryote gene data classifier design based on SVM

Gene Recognition Based on Kernel Least Squares SVM.

Classification Method Based on SVM for Human Gene Sequences

SVM Classification of Human Intergenic and Gene Sequences.

Parameters Selection in Gene Selection Using Gaussian Kernel Support Vector Machines by Genetic Algorithm

Prediction of nucleic acid-binding proteins using support vector machines

Gene Expression Data Classification Using SVM-KNN Classifier

Support Vector Machine For Prediction Of Horizontal Gene Transfers In Bacteria Genomes

Identifying Translation Initiation Sites in Prokaryotes Using Support Vector Machine

Multiclass Cancer Classification by Using Fuzzy Support Vector Machine and Binary Decision Tree with Gene Selection

Prediction of protein structure class by coupling improved genetic algorithm and support vector machine

Classifier assessment and feature selection for recognizing short coding sequences of human genes.

Conserved Codon Composition of Ribosomal Protein Coding Genes in Escherichia Coli, Mycobacterium Tuberculosis and Saccharomyces Cerevisiae: Lessons from Supervised Machine Learning in Functional Genomics

Support Vector Machine Classifications For Microarray Expression Data Set

Gene Selection Using Genetic Algorithm and Support Vectors Machines

Research on bioinformatics data classification method based on support vector machine

Gene Selection and Sample Classification Using a Genetic Algorithm and <i>k</i> -Nearest Neighbor Method

ALL/AML Cancer Classification by Gene Expression Data Using SVM and CSVM Approach

Gene Selection for Cancer Classification using Support Vector Machines

SVM-Based Approach for Predicting DNA-Binding Residues in Proteins from Amino Acid Sequences