Abstract:Text classification has been highlighted as the key process to organize online texts for better communication in the Digital Media Age. Text classification establishes classification rules based on text features, so the accuracy of feature selection is the basis of text classification. Facing fast-increasing Chinese electronic documents in the digital environment, scholars have accumulated quite a few algorithms for the feature selection for the automatic classification of Chinese texts in recent years. However, discussion about how to adapt existing feature selection algorithms for various types of Chinese texts is still inadequate. To address this, this study proposes three improved feature selection algorithms and tests their performance on different types of Chinese texts. These include an enhanced CHI square with mutual information (MI) algorithm, which simultaneously introduces word frequency and term adjustment (CHMI); a term frequency–CHI square (TF–CHI) algorithm, which enhances weight calculation; and a term frequency–inverse document frequency (TF–IDF) algorithm enhanced with the extreme gradient boosting (XGBoost) algorithm, which improves the algorithm’s ability of word filtering (TF–XGBoost). This study randomly chooses 3000 texts from six different categories of the Sogou news corpus to obtain the confusion matrix and evaluate the performance of the new algorithms with precision and the F1-score. Experimental comparisons are conducted on support vector machine (SVM) and naive Bayes (NB) classifiers. The experimental results demonstrate that the feature selection algorithms proposed in this paper improve performance across various news corpora, although the best feature selection schemes for each type of corpus are different. Further studies of the application of the improved feature selection methods in other languages and the improvement in classifiers are suggested.

Study on influences of different Chinese word segmentation methods to text automatic classification based on LDA model

CLDA: Feature Selection for Text Categorization Based on Constrained LDA

An adaptive method for text domain similarity calculation

A Comparative Study on Chinese Word Segmentation Using Statistical Models

Study on the Influences of Text Categorization Performance Based on Corpus Information Measurement

Web Text Classification based on LDA Model

When Classical Chinese Meets Machine Learning: Explaining the Relative Performances of Word and Sentence Segmentation Tasks

Text Classification Based on Natural Language Processing and Machine Learning in Multi Label Corpus

A Pragmatic Approach for Classical Chinese Word Segmentation.

Chinese word segmentation and its effect on information retrieval

Using multiple discriminant analysis approach for linear text segmentation

Chinese word segmentation at Peking University

Chinese Web Page Classification Based on Statistical Word Segmentation

Unsupervised segmentation of chinese corpus using accessor variety

Forgetting Word Segmentation in Chinese Text Classification with L1-Regularized Logistic Regression.

Survey on Chinese Word Segmentation

Effects of prosodic patterns and the morpheme position probability on word segmentation and recognition in overlapping ambiguous strings by learners of Chinese

Parsing-based Chinese word segmentation integrating morphological and syntactic information

A discriminative model selection approach and its application to text classification

Human-Computer Interactive Chinese Word Segmentation: an Adaptive Dirichlet Process Mixture Model Approach.

Adapting Feature Selection Algorithms for the Classification of Chinese Texts