A Comparison of Classification Accuracy Achieved with Wrappers, Filters and PCA

Andreas G. K. Janecek,Wilfried N. Gansterer
2008-01-01
Abstract:Dimensionality reduction and feature subset selection are two techniques for reducing the attribute space of a feature set, which is an important component of both supervised and unsupervised classification or regression problems. While in feature subset selection a subset of the original attributes is extracted, dimensionality reduction produces linear combinations of the original attribute set. In this paper we investigate the relationship between attribute reduction techniques and the resulting classification accuracy for two very different application ares: On the one hand, we consider e-mail filtering, where various properties of e-mail messages are extracted, and on the other hand, we consider drug discovery problems, where quantitative representations of molecular structures are encoded in terms of information-preserving descriptor values. In the present work, subsets of the original attributes constructed by filter and wrapper techniques as well as subsets of linear combinations of the original attributes constructed by three different variants of the principle component analysis (PCA) are compared in terms of the classification performance achieved with various machine learning algorithms. We successively reduce the size of the attribute sets and investigate the changes in the classification results. Moreover, we explore the relationship between the variance captured in the linear combinations within PCA and the classification accuracy. First results show that the classification accuracy based on PCA are highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.
What problem does this paper attempt to address?