Analysis On Chinese Quantitative Stylistic Features Based On Text Mining

renkui hou,minghu jiang
DOI: https://doi.org/10.1093/llc/fqu067
IF: 1.299
2016-01-01
Digital Scholarship in the Humanities
Abstract:In this article, data mining was selected to examine whether some linguistic features, taking parts of speech (POS) for instance, can be used as Chinese quantitative stylistic feature. It can be also said that the purpose of this article is to explore the method to determine the Chinese quantitative stylistic features. Texts of different styles, which are news, science, official, art, TV conversation, and daily conversation styles, were selected to establish the corpus for our study. Text vectors characterized by POS were analyzed by principal component analysis and clustered by agglomerative hierarchical clustering method. The results of them indicate that POS can be used as a distinctive feature of texts. Then, support vector machine was adopted to establish classification model on training data and precision and recall rates to validate the results of text classification. Random forest was selected to compute the importance of POS, i.e. the contribution to classification, and text vectors characterized by important POS were clustered and classified consequently. The results of the experiments show that POS can be taken as Chinese quantitative stylistic feature, and the results of clustering and classification are preferably taking the 60 most important POS as the character of texts.
What problem does this paper attempt to address?