Approximate Joins for XML at Label Level

Fei Li,Hongzhi Wang,Liang Hao,Jianzhong Li,Hong Gao
DOI: https://doi.org/10.1016/j.ins.2014.06.007
IF: 8.1
2014-01-01
Information Sciences
Abstract:In heterogeneous XML data sources, the same real-world object may not be represented exactly the same. Thus approximate join techniques are often applied, in which XML documents are joined based on similarity. In previous XML join methods, researchers consider each XML label as a unit and entirely disregard the similarity between different labels. However, real-world data sets are often 'dirty'. The labels should be also approximately matched in the join. To improve the join quality, our approach considers both XML structure and node label similarity by applying two tailored similarity measures. Min-hash, a probabilistic hash function, is employed to achieve scalability. Extensive experiments confirm that the join quality is fundamentally improved when the label similarity is considered and our join efficiency is even higher than some of the most efficient methods.
What problem does this paper attempt to address?