Abstract:With the continuous development of the Internet and information technology, more and more mobile terminals, wear equipment etc. contribute to the tremendous data. Thanks to the distributed computing, we can analyze the big data with quite high speed. However, many kinds of big data have an obvious common character that the datasets grow incrementally overtime, Which means the distributed computing should focus on incremental processing. A number of systems for incremental data processing are available, such as Google's Percolator and Yahoo's CBP. However, in order to utilize these mature framework, one needs to make a troublesome change for their program to adapt to the environment requirement. In this paper, we introduce a MapReduce framework, named Hadlnc, for efficient incremental computations. Hadlnc is designed for offline scenes, in which real-time is needless and in-memory cluster computing is invalid. Hadlnc takes the advantages of finer-grained computing and Content-defined Chunking(CDC) to make sure that the system can still reuse the results which we have computed before, even if the split data has been changed seriously. Instead of re-computing the changed data entirely, Hadlnc can quickly find out the difference between the new split and the old one, and then merge the delta and old results into the latest result of the new datasets. Meanwhile, the dividing stability of the datasets is a key factor for reusing the results. In order to guarantee the stability of the dataset's division, we propose a series of novel algorithms based on CDC. We implemented Hadlnc by extending the Hadoop framework, and evaluated it with many experiments including three specific cases and a practical case. From the comparing results it can be seen that the proposed Hadlnc is very efficient. (C) 2017 Elsevier B.V. All rights reserved.

Efficient Snapshot Knn Join Processing for Large Data Using Mapreduce

Join Query Optimization Based on MapReduce under Skewed Data

Efficient Processing of k Nearest Neighbor Joins using MapReduce

Solutions for Processing K Nearest Neighbor Joins for Massive Data on MapReduce

Efficient Multi-dimensional Spatial RkNN Query Processing with MapReduce

Inverted Voronoi-Based Knn Query Processing With Mapreduce

Distributed Spatio-Temporal K Nearest Neighbors Join.

A Novel Knn Join Algorithms Based on Hilbert R-Tree in Mapreduce

Efficient Finer-Grained Incremental Processing with MapReduce for Big Data

Efficient index-based KNN join processing for high-dimensional data

Irregular Partitioning Method Based K-Nearest Neighbor Query Algorithm Using Mapreduce

Scalable Distributed Knn Processing on Clustered Data Streams.

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

An Efficient Theta-Join Query Processing in Distributed Environment

High-dimensional kNN joins with incremental updates

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Efficiently Processing Snapshot and Continuous Reverse K Nearest Neighbors Queries

Efficiently Processing Continuous K-Nn Queries On Data Streams

An Efficient K-Means Clustering Algorithm On Mapreduce

Cloud-assisted spatio-textual k nearest neighbor joins in sensor networks

An Efficient Top-k Spatial Join Query Processing Algorithm on Big Spatial Data.