Abstract:Text classification has been highlighted as the key process to organize online texts for better communication in the Digital Media Age. Text classification establishes classification rules based on text features, so the accuracy of feature selection is the basis of text classification. Facing fast-increasing Chinese electronic documents in the digital environment, scholars have accumulated quite a few algorithms for the feature selection for the automatic classification of Chinese texts in recent years. However, discussion about how to adapt existing feature selection algorithms for various types of Chinese texts is still inadequate. To address this, this study proposes three improved feature selection algorithms and tests their performance on different types of Chinese texts. These include an enhanced CHI square with mutual information (MI) algorithm, which simultaneously introduces word frequency and term adjustment (CHMI); a term frequency–CHI square (TF–CHI) algorithm, which enhances weight calculation; and a term frequency–inverse document frequency (TF–IDF) algorithm enhanced with the extreme gradient boosting (XGBoost) algorithm, which improves the algorithm’s ability of word filtering (TF–XGBoost). This study randomly chooses 3000 texts from six different categories of the Sogou news corpus to obtain the confusion matrix and evaluate the performance of the new algorithms with precision and the F1-score. Experimental comparisons are conducted on support vector machine (SVM) and naive Bayes (NB) classifiers. The experimental results demonstrate that the feature selection algorithms proposed in this paper improve performance across various news corpora, although the best feature selection schemes for each type of corpus are different. Further studies of the application of the improved feature selection methods in other languages and the improvement in classifiers are suggested.

Research on the Methods of Chinese Text Classification Using Bayes and Language Model

Chinese Text Classification Using BERT and Flat-Lattice Transformer.

A Technique For Improving The Performance Of Naive Bayes Text Classification

Naive Bayes Based Criminal Text Classification of Unbalanced Classes

Improving Naive Bayes Text Classifier Using Smoothing Methods.

Improving Naive Bayes Text Classirier Using Smoothing Methods

Varying Naive Bayes Models with Applications to Classification of Chinese Text Documents

Experimental Study on Sentiment Classification of Chinese Review Using Machine Learning Techniques

Chinese text classification method based on sentence information enhancement and feature fusion

Research on Text Classification Based on BERT-BiGRU Model

A Model-based Feature Optimization Approach to Chinese Language Processing.

A Chinese Text Classification Method Based on BERT and Convolutional Neural Network

Text Classification Based on Natural Language Processing and Machine Learning in Multi Label Corpus

Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models

Comparison of Several Smoothing Methods in Statistical Language Model

Attention-based BILSTM network with part-of-speech features for Chinese text classification

Classifying Chinese Texts in Two Steps.

A Model-based Feature Optimization Approach to Chinese Language Processing

Adapting Feature Selection Algorithms for the Classification of Chinese Texts