Continuous similarity join on data streams

Jia Cui,Weiping Wang,Dan Meng,Zhenyan Liu
DOI: https://doi.org/10.1109/PADSW.2014.7097853
2014-01-01
Abstract:Similarity join plays an important role in many applications, such as data cleaning and integration, to address the poor data quality problem. Most of the existing studies focused on performing similarity join on static datasets but few studies realized running it on dynamic data streams. With the development of network technology, the data accessing paradigm has transferred from disk-oriented mode to online data streams, which makes performing similarity join in continuous query on data streams become a novel query processing paradigm. Different from static dataset, data stream is unbounded, continuous and unpredictable. The significant differences pose serious challenges, such as real-time query performance. To this end, we study the problem of continuous similarity join on data streams in this paper, which is based on edit distance metric and filter-and-verify framework with sliding-window semantics. Two subcases of this problem are studied, including self similarity join on a single data stream and similarity join on two streams. We introduced the basic window based sliding window model to facilitate the update of sliding window and its index. More details of our method, including signature extraction schemes, filtering and verification algorithms, re-evaluation strategies are discussed respectively. Finally, extensive experimental results show that our method works efficiently on real data streams.
What problem does this paper attempt to address?