Abstract:distributed weighted random sampling algorithm in distributed stream systems is novel.Multi-level query processing is presented to solve complex tasks and queries in distributed system.A synthetic processing topology strategy is provided to merge streams on partial repeated computing of overlap data.Early results with statistical estimations are continuously sent to users. Interactive query processing aims at generating approximate results with minimum response time. However, it is quite difficult for a batch-oriented processing system to progressively provide cumulatively accurate results in the context of a distributed environment. MapReduce Online extends the MapReduce framework to support online aggregation, but it is hindered by its processing speed in keeping up with ongoing real-time data events. We deploy the online aggregation algorithm over S4, a scalable stream processing system that is inspired by the combined functionalities of MapReduce and Actor model. Our system applies an asynchronous message communication mechanism from actor model to support online aggregation. It can process large scale data stream with high concurrency in a short response time. In this system, we adopt a distributed weighted random sampling algorithm to solve biased distribution between different streams. Furthermore, a multi-level query processing topology is developed to reduce overlapped processing for multiple queries. Our system can provide continuous window aggregation with a confidence interval and error bound. We have implemented our system and conducted plentiful experiments over the TPC-H benchmark. A large number of experiments are carried out to demonstrate that by using our system, high-quality query results can be generated within a short response time and that the approach outperforms MapReduce Online on data streams.

DS<SUP>2</SUP> : Handling Data Skew Using Data Stealings over High-Speed Networks

A Request Skew Aware Heterogeneous Distributed Storage System Based on Cassandra

Distributed High-Dimension Matrix Operation Optimization on Spark

Join Query Optimization Based on MapReduce under Skewed Data

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

OPTIMIZATION FOR SPARK MISSION PERFORMANCE BASED ON DATA CHARACTERISTICS

Skewed Data Distribution for Active Storage Systems on Hybrid Servers

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

SparkDQ: Efficient Generic Big Data Quality Management on Distributed Data-Parallel Computation

DSA: Scalable Distributed Sequence Alignment System Using SIMD Instructions.

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis

TS-Hadoop: Handling Access Skew in MapReduce by Using Tiered Storage Infrastructure

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

Effective Data Distribution And Reallocation Strategies For Fast Query Response In Distributed Query-Intensive Data Environments

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

Asynchronous Complex Analytics in a Distributed Dataflow Architecture

Progressive online aggregation in a distributed stream system