Abstract:Purpose In the era of Big Data, network digital resources are growing rapidly, especially the short-text resources, such as tweets, comments, messages and so on, are showing a vigorous vitality. This study aims to compare the categories discriminative capacity (CDC) of Chinese language fragments with different granularities and to explore and verify feasibility, rationality and effectiveness of the low-granularity feature, such as Chinese characters in Chinese short-text classification (CSTC). Design/methodology/approach This study takes discipline classification of journal articles from CSSCI as a simulation environment. On the basis of sorting out the distribution rules of classification features with various granularities, including keywords, terms and characters, the classification effects accessed by the SVM algorithm are comprehensively compared and evaluated from three angles of using the same experiment samples, testing before and after feature optimization, and introducing external data. Findings The granularity of a classification feature has an important impact on CSTC. In general, the larger the granularity is, the better the classification result is, and vice versa. However, a low-granularity feature is also feasible, and its CDC could be improved by reasonable weight setting, even exceeding a high-granularity feature if synthetically considering classification precision, computational complexity and text coverage. Originality/value This is the first study to propose that Chinese characters are more suitable as descriptive features in CSTC than terms and keywords and to demonstrate that CDC of Chinese character features could be strengthened by mixing frequency and position as weight.

Discrimination of Chinese quantitative style features based on text clustering

A Study on Chinese Quantitative Stylistic Features and Relation among Different Styles Based on Text Clustering.

Analysis On Chinese Quantitative Stylistic Features Based On Text Mining

Application of Quantitative Characteristics of Chinese Genres in Text Clustering

Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word Usage

A Quantitative Approach to the Stylistic Assessment of the Middle Chinese Texts

Finding Common Features in Multilingual Fake News: a Quantitative Clustering Approach

Mining Stylistic Features of Rhythm and Tempo Based on Text Clustering

Seeing Various Adventures Through a Mirror: Detecting Translator's Stylistic Visibility in Chinese Translations of Alice's Adventure in Wonderland

Typological Features of Zhuang from the Perspective of Word Frequency Distribution.

Word Class,Syntactic Function and Style: A Comparative Study Based on Annotated Corpora

Thematic Concentration As a Discriminating Feature of Text Types

A Comparative Study on Representing Units in Chinese Text Clustering

Analyzing documents with Quantum Clustering: A novel pattern recognition algorithm based on quantum mechanics.

Text clustering on authorship attribution based on the features of punctuations usage

N-grams based feature selection and text representation for Chinese Text Classification

Distributional Character Clustering For Chinese Text Categorization

Corpus-based Quantitative Analysis on Stylistic Difference of Chinese Synonyms

on Chinese Orientation Analysis

A Paper-Text Perspective

Text Stream Clustering Algorithm Based on Adaptive Feature Selection.