Efficient Snapshot Knn Join Processing for Large Data Using Mapreduce

Yupeng Hu,Chong Yang,Cun Ji,Yang Xu,Xueqing Li
DOI: https://doi.org/10.1109/icpads.2016.0098
2016-01-01
Abstract:The kNN join problem, denoted by R x(KNN) S, is to find the k nearest neighbors from a given dataset S for each point in the query set R. It is an operation required by many big data applications. As large volume of data are continuously generated in more and more real-life cases, we address the problem of monitoring kNN join results on data streams. Specifically, we are concerned with answering kNN join periodically at each snapshot which is called snapshot kNN join. Existing kNN join solutions mainly solve the problem on static datasets, or on a single centralized machine, which are difficult to scale to large data on data streams. In this paper, we propose to incrementally calculate the kNN join results of time t(i) from the results of snapshot t(i-1). Typically, for the data continuously generated on the data stream, we can get S-i = Si-1 + Delta S-i for the valid datasets of adjacent snapshots, where Delta S-i denotes the updated points between time t(i -1) and t(i). Our basic idea is to first find the queries in R whose kNN results can be affected by the updated points in Delta S-i, and then update the kNN results of these small part of queries respectively. In this way, we can avoid calculating the kNN join results on the whole dataset S-i in time t(i). We propose an implementation of searching for affected query points in MapReduce to scale to large volume of data. In brief, the mappers partition the datasets into groups, and the reducers search for affected queries separately on each group of points. Furthermore, we present the enhanced strategies of data partitioning and grouping to reduce the shuffling cost and computational cost. Extensive experiments on real-world datasets demonstrate that our proposed methods are efficient, robust, and scalable.
What problem does this paper attempt to address?