Abstract:Text classification has been highlighted as the key process to organize online texts for better communication in the Digital Media Age. Text classification establishes classification rules based on text features, so the accuracy of feature selection is the basis of text classification. Facing fast-increasing Chinese electronic documents in the digital environment, scholars have accumulated quite a few algorithms for the feature selection for the automatic classification of Chinese texts in recent years. However, discussion about how to adapt existing feature selection algorithms for various types of Chinese texts is still inadequate. To address this, this study proposes three improved feature selection algorithms and tests their performance on different types of Chinese texts. These include an enhanced CHI square with mutual information (MI) algorithm, which simultaneously introduces word frequency and term adjustment (CHMI); a term frequency–CHI square (TF–CHI) algorithm, which enhances weight calculation; and a term frequency–inverse document frequency (TF–IDF) algorithm enhanced with the extreme gradient boosting (XGBoost) algorithm, which improves the algorithm’s ability of word filtering (TF–XGBoost). This study randomly chooses 3000 texts from six different categories of the Sogou news corpus to obtain the confusion matrix and evaluate the performance of the new algorithms with precision and the F1-score. Experimental comparisons are conducted on support vector machine (SVM) and naive Bayes (NB) classifiers. The experimental results demonstrate that the feature selection algorithms proposed in this paper improve performance across various news corpora, although the best feature selection schemes for each type of corpus are different. Further studies of the application of the improved feature selection methods in other languages and the improvement in classifiers are suggested.

Improving short text classification using public search engines

Improving Short Text Classification Through Better Feature Space Selection

Extracting Novel Features for E-Commerce Page Quality Classification.

Extracting Novel Features for E-Commerce Page Quality Classification

Research on Deep Web Classification Based on Domain Feature Text

Short Text Classification Based on Strong Feature Thesaurus

Combining Lexical and Semantic Features for Short Text Classification.

Adapting Feature Selection Algorithms for the Classification of Chinese Texts

Extremely Short Chinese Text Classification Method Based on Bidirectional Semantic Extension

Key Information Retrieval to Classify the Unstructured Data Content of Preferential Trade Agreements

Improving Medical Short Text Classification with Semantic Expansion Using Word-Cluster Embedding

Internet Information Search Based Approach to Enriching Textual Descriptions for Public Web Services

An Improved Measuring Similarity For Short Text Snippets And Its Application In Clustering Search Engine

A News Headlines Classification Method Based on the Fusion of Related Words.

Optimizing News Text Classification with Bi-LSTM and Attention Mechanism for Efficient Data Processing

Research on Chinese Text Classification Based on WAE and SVM

Short Text Classification Improved by Feature Space Extension

Short-Text Classification Detector: A Bert-Based Mental Approach

Exploiting Text Content In Image Search By Semi-Supervised Learning Techniques

Short text classification based on bidirectional TCN and attention mechanism