BigSet: an Efficient Set Intersection Approach
Shiding Zhang,Jianye Yang,Wenjie Zhang,Shiyu Yang,Ying Zhang,Xuemin Lin
DOI: https://doi.org/10.1109/tkde.2024.3432595
IF: 9.235
2024-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:Set intersection is a fundamental operation in many applications, such as common neighbor computation in graph-based algorithms, set similarity computation, item recommendation, etc. In the literature, many set intersection methods are proposed. We observe that the state-of-the-art algorithm ${\sf RCode}$ bears several limitations, such as high index time complexity, inefficient for large-sized sets, and not friendly to the generic set intersection. In this paper, we introduce the B ucket S ig nature for Set ( ${\sf BigSet}$ ), an efficient generic set intersection algorithm. ${\sf BigSet}$ consists of two phases, namely the preprocessing phase and the query phase. In the preprocessing phase, ${\sf BigSet}$ partitions the elements of a record into $O(2^{k})$ buckets and uses a bitmap to indicate the status of the buckets where $n$ is the record length and $k$ is the number of bits in the signature. In the query phase, ${\sf BigSet}$ calculates the results using a candidate generating-and-verification framework. Specifically, a set of candidate elements is identified as those falling in the same buckets. Then, for each bucket, ${\sf BigSet}$ collects the common elements using a merge-based method. To improve the performance, we introduce two optimizations, including bucket sharing and size-aware signature construction techniques. We conduct experiments on 10 real graph datasets and 5 real generic set datasets to evaluate the performance of our proposals. The experiment results show that ${\sf BigSet}$ is 20× faster than the leading generic set intersection algorithms. Besides it outperforms the ${\sf RCode}$ with 5× speedup, and while uses up to 8× less memory.