Text clustering on authorship attribution based on the features of punctuations usage

Jin Mingzhe,Minghu Jiang
DOI: https://doi.org/10.1109/ICoSP.2012.6492012
2012-01-01
Abstract:This paper proposes a method of extracting writing characteristics of various authors based on their usage of punctuation marks. Comparative analysis has been done between the text clustering effects of the proposed method and character Bigram method using 200 articles of five well-known modern writers. The analysis also covers the performance of Euclidean distance, cosine distance and KLD (Kullback-Leibler) distance used in the text clustering. In conclusion, the analysis results show that: (1) The method proposed in this paper not only features low dimension, but also is superior to Bigram, (2) KLD has obvious advantages compared to Euclidean distance and cosine distance, and F1 value using the Ward hierarchical clustering of KLD distance can reach 96%~99%.
What problem does this paper attempt to address?