Research on document similarity computing based on multi-grams of context

Feng YU,De-quan ZHENG,Tie-jun ZHAO,Sheng LI
DOI: https://doi.org/10.3969/j.issn.1006-7043.2006.z1.084
2006-01-01
Abstract:A novel solution of computing document similarity based on multi-grams of context is presented in this paper. In this study, the same feature information firstly is acquired from document pairs; and then, the usage of co-occurrence feature information is gotten in the context of speech, semantic, location, weighted average co-occurrence probability, and is expressed as the similarity function; finally, document similarity evaluation value is calculated for each document. The similarity evaluation value plays an important role in judging the document similarity degree. The Chinese document set from the NTCIR-3 workshop collection is used to evaluate the method, it shows that an average 15.45%-18.49% and 11.96%-15.35% increase in precision can be achieved at top 10 and 100 ranking documents level respectively. In another group experiment about the same Web information, average F1-measure of textual similarity is above 95%.
What problem does this paper attempt to address?