Abstract:In this paper, we focus on set similarity join on massive probabilistic data using MapReduce, there is no effective approach that can process this problem efficiently. MapReduce is a popular paradigm that can process large volume data more efficiently, in this paper, we proposed two approaches using MapReduce to deal with this task: Hadoop Join by Map Side Pruning and Hadoop Join by Reduce Side Pruning. Hadoop Join by Map Side Pruning uses the sum of the existence probability to filter out the probabilistic sets directly at the Map task side which have no any chance to be similar with any other probabilistic set. Hadoop Join by Reduce Side Pruning uses probability sum based pruning principle and probability upper bound based pruning principle to reduce the candidate pairs at Reduce task side, it can save the comparison cost. Based on the above approaches, we proposed a hybrid solution that employs both Map-side and Reduce-side pruning methods. Finally we implemented the above approaches on Hadoop-0.20.2 and performed comprehensive experiments to their performance, we also test the speedup ratio compared with the naive method: Block Nested Loop Join. The experiment results show that our approaches have much better performance than that of Block Nested Loop Join and also have good scalability. To the best of our knowledge, this is the first work to try to deal with set similarity join on massive probabilistic data problem using MapReduce paradigm, and the approaches proposed in this paper provide a new way to process the massive probabilistic data.

Large-Scale Similarity Join With Edit-Distance Constraints

Improved LSH-driven String Similarity Join Filtering-Verification Framework

Join Query Optimization Based on MapReduce under Skewed Data

Efficient Parallel Partition-Based Algorithms for Similarity Search and Join with Edit Distance Constraints

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics.

A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints

An Efficient MapReduce Algorithm for Similarity Join in Metric Spaces

Efficient and Scalable Graph Similarity Joins in MapReduce

Practising Scalable Graph Similarity Joins in MapReduce

Intelligent Similarity Joins for Big Data Integration

Efficient Graph Similarity Joins with Edit Distance Constraints

Pass-Join-K: Similarity Join Method Based on Multi-Match Partition

FrepJoin: an Efficient Partition-Based Algorithm for Edit Similarity Join

VChunkJoin: an Efficient Algorithm for Edit Similarity Joins

Efficient Similarity Join Based on Earth Mover’s Distance Using MapReduce

Similarity join on XML based on k-generation set distance

Efficient graph similarity join for information integration on graphs

PASS-JOIN: A Partition-based Method for Similarity Joins

BMGSJoin：A MapReduce Based Graph Similarity Join Algorithm

Set similarity join on massive probabilistic data using MapReduce

Efficient and Scalable Processing of String Similarity Join