Abstract:Traditional text classification technology based on machine learning and data mining techniques has made a big progress. However, it is still a big problem on how to draw an exact decision boundary between relevant and irrelevant objects in binary classification due to much uncertainty produced in the process of the traditional algorithms. The proposed model CTTC (Centroid Training for Text Classification) aims to build an uncertainty boundary to absorb as many indeterminate objects as possible so as to elevate the certainty of the relevant and irrelevant groups through the centroid clustering and training process. The clustering starts from the two training subsets labelled as relevant or irrelevant respectively to create two principal centroid vectors by which all the training samples are further separated into three groups: POS, NEG and BND, with all the indeterminate objects absorbed into the uncertain decision boundary BND. Two pairs of centroid vectors are proposed to be trained and optimized through the subsequent iterative multi-learning process, all of which are proposed to collaboratively help predict the polarities of the incoming objects thereafter. For the assessment of the proposed model, F-1 and Accuracy have been chosen as the key evaluation measures. We stress the F-1 measure because it can display the overall performance improvement of the final classifier better than Accuracy. A large number of experiments have been completed using the proposed model on the Reuters Corpus Volume 1 (RCV1) which is important standard dataset in the field. The experiment results show that the proposed model has significantly improved the binary text classification performance in both F-1 and Accuracy compared with three other influential baseline models.

The Weighted KNN Text Categorization Algorithm Based on Training Set Cutting

An Improved K-Nearest Neighbor Algorithm for Text Categorization

Efficient KNN Text Categorization Based on Multiedit and Condensing Techniques

An adaptive k-nearest neighbor text categorization strategy

An Improved KNN Text Classification Algorithm Based on Clustering

Improved KNN Algorithm based on Probability and Adaptive K Value.

An Effective Feature Selection Method For Text Categorization

A text classification method based on improved KNN algorithm

Improved KNN using clustering algorithm

Improved KNN Text Categorization

Research on Feature Selection and Knn Classification Method in Chinese Text Classification

Improving Performance of the k-Nearest Neighbor Classifier by Combining Feature Selection with Feature Weighting.

Use relative weight to improve the kNN for unbalanced text category

Text Categorization Via Attribute Distance Weighted K-Nearest Neighbor Classification.

Combining Feature Selection with Feature Weighting for k-NN Classifier

Accurate Knn Chinese Text Classification Via Multiple Strategies

Accelerated K-Nearest Neighbors Algorithm Based on Principal Component Analysis for Text Categorization

A kNN Text Categorization Algorithm Base on χ~2 Statistic

Centroid Training to Achieve Effective Text Classification

Improvement of Density-Based Method for Reducing Training Data in KNN Text Classification

Study on Categorization Techniques of Chinese Web Text Using KNN