Assessing the Applicability of PCA in Clustering Analysis of Gene Expression Data

Zhengguo Zhang
2009-01-01
Abstract:Objective To investigate whether the principal component analysis (PCA) should be helpful in improving the accuracy of the clustering in microarray analysis. Methods 3 datasets (Budding yeast, Saccharomyces cerevisiae, and Central nervous system) of microarrays with external criterion were examined. Based on variation of information, the clustering accuracies with PC were compared with those of the original values. A greedy approach was used to find the best PC combinations, and comparing two distance metrics methods, Euclid distance and correlation, also two clustering algorithms, hierarchical clustering and K-medoids clustering. Results In 3 datasets, hierarchical clustering algorithm achieved a little better result than K-medoids clustering algorithm, but in either situation, PCA did not improve the accuracy of the clustering, if not worse. Only in the Saccharomyces cerevisiae dataset, when the number of PC was large enough to cover 90% -95% of the variance, there existed certain combinations of PC which leaded to a better clustering result. However, there were no regular patterns to follow. Conclusion In most microarray experiments without well-known background model, PC extracted from the datasets should not be used as inputs of clustering algorithms blindly.
What problem does this paper attempt to address?