Abstract:With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.

Continuous similarity join on data streams

Improved LSH-driven String Similarity Join Filtering-Verification Framework

Mining Fuzzy Association Rules in Data Streams

Algorithm Based on Sliding Window for Similarity Queries over Data Stream

Distributed Streaming Set Similarity Join

Continuous similarity join over geo-textual data streams

Fast Similarity Matching on Data Stream with Noise.

Estimating Similarity over Data Streams Based on Dynamic Time Warping

Continuous similarity search for evolving queries

EMD-DSJoin: Efficient Similarity Join Over Probabilistic Data Streams Based on Earth Mover’s Distance

Dynamic Set Similarity Join: an Update Log Based Approach

Technology of Continuous Query Optimization over Data Streams

Efficient and Scalable Processing of String Similarity Join

Verification method for string similarity joins based on bi-directional filtering

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics.

Large-Scale Similarity Join With Edit-Distance Constraints

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs.

Similarity Match over High Speed Time-Series Streams

PASS-JOIN: A Partition-based Method for Similarity Joins

On Generalized Bisimilarity Join

Intelligent Similarity Joins for Big Data Integration