Abstract:distributed weighted random sampling algorithm in distributed stream systems is novel.Multi-level query processing is presented to solve complex tasks and queries in distributed system.A synthetic processing topology strategy is provided to merge streams on partial repeated computing of overlap data.Early results with statistical estimations are continuously sent to users. Interactive query processing aims at generating approximate results with minimum response time. However, it is quite difficult for a batch-oriented processing system to progressively provide cumulatively accurate results in the context of a distributed environment. MapReduce Online extends the MapReduce framework to support online aggregation, but it is hindered by its processing speed in keeping up with ongoing real-time data events. We deploy the online aggregation algorithm over S4, a scalable stream processing system that is inspired by the combined functionalities of MapReduce and Actor model. Our system applies an asynchronous message communication mechanism from actor model to support online aggregation. It can process large scale data stream with high concurrency in a short response time. In this system, we adopt a distributed weighted random sampling algorithm to solve biased distribution between different streams. Furthermore, a multi-level query processing topology is developed to reduce overlapped processing for multiple queries. Our system can provide continuous window aggregation with a confidence interval and error bound. We have implemented our system and conducted plentiful experiments over the TPC-H benchmark. A large number of experiments are carried out to demonstrate that by using our system, high-quality query results can be generated within a short response time and that the approach outperforms MapReduce Online on data streams.

Research on Parallel Duplicated Webpage Deletion Based on MapReduce Model

Evaluating Large Graph Processing in MapReduce Based on Message Passing

The performance of MapReduce: an in-depth study

The Performance of MapReduce

Parallel Approach and Platform for Large-Scale WEB Data Extraction

Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages

Performance Evaluation of the MapReduce-based Parallel Data Preprocessing Algorithm in Web Usage Mining with Robot Detection Approaches

Progressive Image Retrieval with Quality Guarantee under MapReduce Framework

Query optimization for massively parallel data processing.

Combination of in-memory graph computation with mapreduce: a subgraph-centric method of pagerank

Uncoupled MapReduce: A Balanced and Efficient Data Transfer Model

Progressive online aggregation in a distributed stream system

Duplicate Web Page Elimination Based on Bloom Filter

A Comparative Study on Parallel Lda Algorithms in Mapreduce Framework

A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures

Research and Evaluation of Near-replicas of Web Pages Detection Algorithms

MapDupReducer: detecting near duplicates over massive datasets.

Optimizing Internal Overlaps by Self-Adjusting Resource Allocation in Multi-Stage Computing Systems

A Semantic++ MapReduce Parallel Programming Model.