Abstract:In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.

Improving Domain Dictionary-based Text Categorization Using Self-partition Model.

Text Categorization Based on Domain Ontology

Improving Short Text Classification Through Better Feature Space Selection

Improving Text Categorization Using the Importance of Words in Different Categories

A Kind of Self-Constructed Category Dictionary in Chinese Text Classification

Aggressive Dimensionality Reduction With Reinforcement Local Feature Selection For Text Categorization

Improved VSM Based on Chinese Text Categorization

Chinese Text Categorization Without Word Segmentation Using String Kernel

Experimental Study On Representing Units In Chinese Text Categorization

Text Categorization with Lee Model

Chinese Short-Text Categorization Based on the Key Classification Dictionary Words

Text Representations for Text Categorization: A Case Study in Biomedical Domain

Learning Effective Features for Chinese Text Categorization

Research and Implementation of Related Algorithm of Chinese Text Categorization

An Improved Text Categorization Algorithm Based on VSM

An Improved Random Forest Classifier for Text Categorization.

Exploiting Probabilistic Topic Models to Improve Text Categorization under Class Imbalance

Hierarchical Categorization Methods of Chinese Text Based on Vector Space Model

Text Categorization Method Based on Improved Mutual Information and Characteristic Weights Evaluation Algorithms

A Text Categorization Method Based on SVM and Improved K-Means

Context Based Feature Description Model in Chinese Text Categorization