Using Dimension Reduction to Improve the Classification of High-dimensional Data

Andreas Grünauer,Markus Vincze
DOI: https://doi.org/10.48550/arXiv.1505.06907
2015-05-26
Abstract:In this work we show that the classification performance of high-dimensional structural MRI data with only a small set of training examples is improved by the usage of dimension reduction methods. We assessed two different dimension reduction variants: feature selection by ANOVA F-test and feature transformation by PCA. On the reduced datasets, we applied common learning algorithms using 5-fold cross-validation. Training, tuning of the hyperparameters, as well as the performance evaluation of the classifiers was conducted using two different performance measures: Accuracy, and Receiver Operating Characteristic curve (AUC). Our hypothesis is supported by experimental results.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: in the case of only a small number of training samples, how to improve the classification performance of high - dimensional structured MRI data by using dimension reduction methods. Specifically, the author focuses on the fact that in high - dimensional data (such as structured MRI data), due to the large number of features, it is easy to cause over - fitting problems. To solve this problem, they studied the impact of two different dimension reduction methods on classification performance: 1. **Feature Selection**: Feature selection is carried out through ANOVA F - test. 2. **Feature Transformation**: Feature transformation is carried out through principal component analysis (PCA). ### Detailed Explanation #### 1. Research Background When machine learning algorithms process high - dimensional data, as the number of features increases, the risk of over - fitting also increases. Dimension reduction methods can not only avoid over - fitting, but also make the training process of high - dimensional data more efficient. This paper aims to explore the impact of dimension reduction techniques on the performance of different classifiers. #### 2. Method Overview - **Feature Selection**: Use ANOVA F - test to select the most representative features. The F - value of ANOVA F - test is defined as: \[ F=\frac{MSB}{MSW} \] where, - \(MSB\) represents the variance between groups, and the calculation formula is: \[ MSB = \frac{\sum_{i}n_{i}(\bar{x}_{i}-\bar{x})^{2}}{m - 1} \] where \(n_{i}\) is the number of observations in the \(i\) - th group, \(\bar{x}_{i}\) is the sample mean of the \(i\) - th group, \(\bar{x}\) is the overall mean of all data, and \(m\) is the number of groups. - \(MSW\) represents the variance within groups, and the calculation formula is: \[ MSW=\frac{\sum_{i,j}(x_{ij}-\bar{x}_{i})^{2}}{n - m} \] where \(x_{ij}\) is the \(j\) - th observation value in the \(i\) - th group. - **Feature Transformation**: Use PCA to project the original high - dimensional data onto a low - dimensional space. PCA reduces the data dimension by finding the first \(s\) orthogonal linear combinations with the largest variance. #### 3. Experimental Setup - **Dataset**: Use the binary - classification task dataset provided by the MICCAI 2014 Machine Learning Challenge, which contains 250 T1 - weighted structural brain MRI scan images, and each scan provides 184 morphological features. - **Evaluation Metrics**: Use accuracy (Accuracy) and the area under the receiver operating characteristic curve (AUC) as performance evaluation metrics. - **Cross - Validation**: Adopt 5 - fold cross - validation for model training and testing. #### 4. Results and Discussion - **Feature Selection**: When 12 features are selected, the classifier has been able to reach or exceed the performance when using the original 184 features. Further increasing the number of features will not significantly improve the performance, but may lead to over - fitting instead. - **Feature Transformation**: For the data after PCA dimension reduction, the performance of SVM - RBF and KNN is independent of the number of principal components used, but the performance of other classifiers is better than that of the original features when using the first 3 principal components, and the performance decreases when using 12 principal components, and reaches the best when using 24 principal components. #### 5. Conclusion This study shows that dimension reduction methods (especially ANOVA F - test feature selection) can effectively improve the classification performance of high - dimensional structured MRI data, especially in the case of fewer training samples. In addition, simple classifiers (such as GNB and Ridge) can also achieve results comparable to or even better than complex classifiers (such as RBF - SVM) on the data after dimension reduction. Through these studies, the author has proved the importance and effectiveness of dimension reduction methods in high - dimensional data classification.