Abstract:distributed weighted random sampling algorithm in distributed stream systems is novel.Multi-level query processing is presented to solve complex tasks and queries in distributed system.A synthetic processing topology strategy is provided to merge streams on partial repeated computing of overlap data.Early results with statistical estimations are continuously sent to users. Interactive query processing aims at generating approximate results with minimum response time. However, it is quite difficult for a batch-oriented processing system to progressively provide cumulatively accurate results in the context of a distributed environment. MapReduce Online extends the MapReduce framework to support online aggregation, but it is hindered by its processing speed in keeping up with ongoing real-time data events. We deploy the online aggregation algorithm over S4, a scalable stream processing system that is inspired by the combined functionalities of MapReduce and Actor model. Our system applies an asynchronous message communication mechanism from actor model to support online aggregation. It can process large scale data stream with high concurrency in a short response time. In this system, we adopt a distributed weighted random sampling algorithm to solve biased distribution between different streams. Furthermore, a multi-level query processing topology is developed to reduce overlapped processing for multiple queries. Our system can provide continuous window aggregation with a confidence interval and error bound. We have implemented our system and conducted plentiful experiments over the TPC-H benchmark. A large number of experiments are carried out to demonstrate that by using our system, high-quality query results can be generated within a short response time and that the approach outperforms MapReduce Online on data streams.

Unsupervised Blocking and Probabilistic Parallelisation for Record Matching of Distributed Big Data

Probabilistic Parallelisation of Blocking Non-Matched Records for Big Data

Distributed High-Dimension Matrix Operation Optimization on Spark

Leveraging unlabeled data to scale blocking for record linkage

A Distributed and Scalable Machine Learning Approach for Big Data

Design of PPU framework for processing ordered data blocks in the cluster environment

Progressive online aggregation in a distributed stream system

A practical approach for scalable record linkage on hadoop

Efficient Data Blocking and Skipping Framework Applying Heuristic Rules

An Ensemble Blocking Scheme for Entity Resolution of Large and Sparse Datasets

Distributed Mining of Frequent Patterns in Big Data by Hybrid Strategies.

Superblock: An Application-Aware Dynamic Partition Strategy for Large-Scale Graph

A Novel Lightweight Middleware for Distributed Massive PMU Data Mining

Experimental Study on Block and Ratio Storage Strategy and Parallel Transmission Algorithms

PROBING PARALLEL TECHNIQUE-BASED STATISTICAL ANALYSIS FOR ENORMOUS DATA

A New Hybrid Approach for Privacy Preserving Distributed Data Mining

A Survey and Experimental Analysis of Distributed Subgraph Matching

Parallel Algorithms for Flexible Pattern Matching on Big Graphs

Distributed Subgraph Matching on Timely Dataflow.

Distributed Subgraph Matching on Timely Dataflow [Experiments and Analyses]

Block Storage Optimization and Parallel Data Processing and Analysis of Product Big Data Based on the Hadoop Platform