Text Categorization Via Attribute Distance Weighted K-Nearest Neighbor Classification.

Herman Masindano Wandabwa,Defu Zhang,Korir Sammy
DOI: https://doi.org/10.1109/icit.2016.053
2016-01-01
Abstract:Text categorization entails making a decision on whether a document belongs to a set of pre-specified classes of other documents. This can be in a supervised way in classification tasks or unsupervised reminiscent of clustering related tasks. Categorization can be a challenging task especially when the discriminating words are large. K-Nearest Neighbor is an instance based learning algorithm that has proven to be effective in such classification tasks including documents. The key element of this algorithm lies in the similarity measurement principle that is capable of identifying neighbors of a particular document to high accuracies. The only drawback of this approach is in the weighting of all features to determine the distance among the documents in question. This is not only time consuming but also overuses computer resources without adding anything substantial to the overall results. In our approach (Attribute Distance Weighted - KNN), we do not make use of all features in the corpus but first extract the most relevant ones by weighting them in relation to the corpus. We then calculated the distance between the highly ranked features in the corpus alone as a representative of the entire document set. So far no known literature has inclined towards this approach thus our comparison will be in relation to the classical KNN measure. Our approach showed marginal performance in distance measure compared to classical KNN.
What problem does this paper attempt to address?