Abstract:k is the most important parameter in a text categorization system based on k-Nearest Neighbor algorithm (kNN).In the classification process, k nearest documents to the test one in the training set are determined firstly. Then, the predication can be made according to the category distribution among these k nearest neighbors. Generally speaking, the class distribution in the training set is uneven. Some classes may have more samples than others. Therefore, the system performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias on large categories. To deal with these problems, we propose an improved kNN algorithm, which uses different numbers of nearest neighbors for different categories, rather than a fixed number across all categories. More samples (nearest neighbors) will be used for deciding whether a test document should be classified to a category, which has more samples in the training set. Preliminary experiments on Chinese text categorization show that our method is less sensitive to the parameter k than the traditional one, and it can properly classify documents belonging to smaller classes with a large k. The method is promising for some cases, where estimating the parameter k via cross-validation is not allowed.

Non-Independent Term Selection for Chinese Text Categorization

CLDA: Feature Selection for Text Categorization Based on Constrained LDA

Scalable Term Selection for Text Categorization.

Aggressive Dimensionality Reduction With Reinforcement Local Feature Selection For Text Categorization

Learning Effective Features for Chinese Text Categorization

A comprehensive unsupervised feature selection method of two-stage strategy

A High Performance Two-Class Chinese Text Categorization Method

Improving Short Text Classification Through Better Feature Space Selection

Relative Term-Frequency Based Feature Selection for Text Categorization

Dimensionality Reduction With Category Information Fusion And Non-Negative Matrix Factorization For Text Categorization

An Efficient Feature Selection Method Using Named Entity Recognition for Chinese Text Categorization

A New Approach of Feature Selection for Text Categorization

An Effective Feature Selection Method For Text Categorization

Select Strong Information Features to Improve Text Categorization Effectiveness

N-grams based feature selection and text representation for Chinese Text Classification

Collaborative Work with Linear Classifier and Extreme Learning Machine for Fast Text Categorization

Eliminating High-Degree Biased Character Bigrams for Dimensionality Reduction in Chinese Text Categorization

Fast text categorization based on collaborative work in the semantic and class spaces

A Novel Term Weighting Scheme for Automated Text Categorization

Text Categorization Based On. Concept Indexing and Principal Component Analysis

An Improved K-Nearest Neighbor Algorithm for Text Categorization