Abstract:PURPOSE: The small number of samples available for training and testing is often the limiting factor in finding the most effective features and designing an optimal computer-aided diagnosis (CAD) system. Training on a limited set of samples introduces bias and variance in the performance of a CAD system relative to that trained with an infinite sample size. In this work, the authors conducted a simulation study to evaluate the performances of various combinations of classifiers and feature selection techniques and their dependence on the class distribution, dimensionality, and the training sample size. The understanding of these relationships will facilitate development of effective CAD systems under the constraint of limited available samples.METHODS: Three feature selection techniques, the stepwise feature selection (SFS), sequential floating forward search (SFFS), and principal component analysis (PCA), and two commonly used classifiers, Fisher's linear discriminant analysis (LDA) and support vector machine (SVM), were investigated. Samples were drawn from multidimensional feature spaces of multivariate Gaussian distributions with equal or unequal covariance matrices and unequal means, and with equal covariance matrices and unequal means estimated from a clinical data set. Classifier performance was quantified by the area under the receiver operating characteristic curve Az. The mean Az values obtained by resubstitution and hold-out methods were evaluated for training sample sizes ranging from 15 to 100 per class. The number of simulated features available for selection was chosen to be 50, 100, and 200.RESULTS: It was found that the relative performance of the different combinations of classifier and feature selection method depends on the feature space distributions, the dimensionality, and the available training sample sizes. The LDA and SVM with radial kernel performed similarly for most of the conditions evaluated in this study, although the SVM classifier showed a slightly higher hold-out performance than LDA for some conditions and vice versa for other conditions. PCA was comparable to or better than SFS and SFFS for LDA at small samples sizes, but inferior for SVM with polynomial kernel. For the class distributions simulated from clinical data, PCA did not show advantages over the other two feature selection methods. Under this condition, the SVM with radial kernel performed better than the LDA when few training samples were available, while LDA performed better when a large number of training samples were available.CONCLUSIONS: None of the investigated feature selection-classifier combinations provided consistently superior performance under the studied conditions for different sample sizes and feature space distributions. In general, the SFFS method was comparable to the SFS method while PCA may have an advantage for Gaussian feature spaces with unequal covariance matrices. The performance of the SVM with radial kernel was better than, or comparable to, that of the SVM with polynomial kernel under most conditions studied.

Experimental study of recognition rate in statistical pattern classification based on finite size of design sample set

On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition

Small sample size effects in statistical pattern recognition: recommendations for practitioners

A Discriminant Model for the Pattern Recognition of Linearly Independent Samples

Achievable Rates for Pattern Recognition

Sequential recognition using a nonparametric ranking procedure

Supervised Pattern Recognition Involving Skewed Feature Densities

Constrained Linear Discrimination Analysis and Face Recognition

A Density-ratio Framework for Statistical Data Processing.

Minimax Deviation Strategies for Machine Learning and Recognition with Short Learning Samples

Effect of finite sample size on feature selection and classification: A simulation study

Probabilistic Safety Regions Via Finite Families of Scalable Classifiers

Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms

Linear representation of intra‐class discriminant features for small‐sample face recognition

Optimal Dimensionality of Metric Space for Classification

Evaluation of a decided sample size in machine learning applications

Dynamical pattern recognition for sampling sequences based on deterministic learning and structural stability

Enhancing the pattern recognition capacity of machine learning techniques: The importance of feature positioning

Repeated Observations for Classification

Discriminative Density-ratio Estimation

Asymptotic Generalization Bound of Fisher’s Linear Discriminant Analysis