File Type Detection Algorithm Based on Principal Component Analysis and K Nearest Neighbors

YAN Mengdi,QIN Linlin,WU Gang
DOI: https://doi.org/10.11772/j.issn.1001-9081.2016.11.3161
2016-01-01
Journal of Computer Applications
Abstract:In order to solve the problem that using the file suffix and file feature to identify file type may cause a low recognition accuracy rate,a new content-based file-type detection algorithm was proposed,which was based on Principal Component Analysis (PCA) and K Nearest Neighbors ( KNN).Firstly,PCA algorithm was used to reduce the dimension of the sample space.Then by clustering the training samples,each file type was represented by cluster centroids.In order to reduce the error caused by unbalanced training samples, K NN algorithm based on distance weighting was proposed.The experimental result shows that the improved algorithm,in the case of a large number of training samples,can reduce computational complexity,and can maintain a high recognition accuracy rate.This algorithm doesn't depend on the feature of each file,so it can be used more widely.
What problem does this paper attempt to address?