Application of Quantitative Characteristics of Chinese Genres in Text Clustering

黄伟,刘海涛
DOI: https://doi.org/10.3778/j.issn.1002-8331.2009.29.007
2009-01-01
Abstract:The method of applying the findings in quantitative study on linguistics to research on text clustering is presented.16 linguistic structures,which distribute distinctively between oral and written Chinese,are investigated based on two sample corpora with size of half million words for each.Test texts represented by using 7 of those linguistic structures are correctly clustered into spoken(similarity=89.84%) and written(similarity=86.93%) classes in a text clustering experiment.The method of representing texts with quantitative characteristics of linguistic structures enhances the interpretability of the results,and is feasible and theoretically and practicably significative in text clustering and text classification.Corpus and statistics are methodologically significant in describing study on Chinese genres,the theoretical foundations of which are also included.
What problem does this paper attempt to address?