Abstract:distributed weighted random sampling algorithm in distributed stream systems is novel.Multi-level query processing is presented to solve complex tasks and queries in distributed system.A synthetic processing topology strategy is provided to merge streams on partial repeated computing of overlap data.Early results with statistical estimations are continuously sent to users. Interactive query processing aims at generating approximate results with minimum response time. However, it is quite difficult for a batch-oriented processing system to progressively provide cumulatively accurate results in the context of a distributed environment. MapReduce Online extends the MapReduce framework to support online aggregation, but it is hindered by its processing speed in keeping up with ongoing real-time data events. We deploy the online aggregation algorithm over S4, a scalable stream processing system that is inspired by the combined functionalities of MapReduce and Actor model. Our system applies an asynchronous message communication mechanism from actor model to support online aggregation. It can process large scale data stream with high concurrency in a short response time. In this system, we adopt a distributed weighted random sampling algorithm to solve biased distribution between different streams. Furthermore, a multi-level query processing topology is developed to reduce overlapped processing for multiple queries. Our system can provide continuous window aggregation with a confidence interval and error bound. We have implemented our system and conducted plentiful experiments over the TPC-H benchmark. A large number of experiments are carried out to demonstrate that by using our system, high-quality query results can be generated within a short response time and that the approach outperforms MapReduce Online on data streams.

Fast Greedy Algorithms in MapReduce and Streaming

GreedyML: A Parallel Algorithm for Maximizing Submodular Functions

Greedy Column Subset Selection: New Bounds and Distributed Algorithms

Progressive online aggregation in a distributed stream system

Greed is Good: Near-Optimal Submodular Maximization via Greedy Optimization

A Greedy Algorithm for Optimally Pipelining a Reduction

Parallelizing greedy for submodular set function maximization in matroids and beyond

A New Framework for Distributed Submodular Maximization

Improved Deterministic Streaming Algorithms for Non-monotone Submodular Maximization

Distributed Algorithms for Composite Optimization: Unified Framework and Convergence Analysis

GreediRIS: Scalable Influence Maximization using Distributed Streaming Maximum Cover

Fast Clustering using MapReduce

Streaming Algorithms for Maximizing k-Submodular Functions with the Multi-knapsack Constraint

Efficient Principal Subspace Projection of Streaming Data Through Fast Similarity Matching

Submodular Optimization in the MapReduce Model

Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization

A General Framework for Privacy-Preserving Distributed Greedy Algorithm.

Lazier Than Lazy Greedy

Greed Works -- Online Algorithms For Unrelated Machine Stochastic Scheduling

When greedy gives optimal: A unified approach

Semi-streaming algorithms for submodular matroid intersection