An Efficient Structure Similarity Measure Method For Xml Documents Based On Vector Space Model

Hongcan Yan,Minqiang Li,Dianchuan Jin,Dazhuo Zhou,Shaohong Yan
2008-01-01
Abstract:A novel way of similarity measure for XML documents structure based on frequency structured vector model is proposed against the detects of the methods in existence. In this model, all frequent subtrees of documents are viewed as structured characteristic space; the expression of document structured vector and weight function are derived and the angle cosine between Eqtwo vectors is applied to measure similarity of the two documents. At the same time, the algorithm TreeMiner is reformed to improve the efficiency of mining frequency subtrees in a forest from data structure and mining process, which entitled TreeMiner+. The experimental results show that this method acquires very high precision and accuracy, the time cost of algorithm TreeMiner+ is reduced three times when minimum support is 70% or higher.
What problem does this paper attempt to address?