Chinese Text Similarity Method Research by Combining Semantic Analysis with Statistics

HUA Xiu-li,ZHU Qiao-ming,LI Pei-feng
DOI: https://doi.org/10.3969/j.issn.1001-3695.2012.03.008
2012-01-01
Abstract:Based on the statistical text similarity measurements method used TF-IDF method to model text documents as term frequency vectors,and computed similarity between documents by using cosine similarity.This method ignored semantic information of text documents,the similarity value wasn't correct.Although based on semantics method made up for the drawback,but need of knowledge to construct the relationship between words.By studying the advantages and disadvantages of two kinds of methods,this paper presented a novel text similarity method,which firstly pre-processed text,then chose the terms with higher TF-IDF value as the feature items,next used semantic dictionary and TF-IDF method to compute the text similarity,finally used several K-means clustering methods for evaluating performance of the new text document similarity.Experimental results show that the method's F-measure is superior to the others' which proves that the proposed method is effective.
What problem does this paper attempt to address?