Abstract:distributed weighted random sampling algorithm in distributed stream systems is novel.Multi-level query processing is presented to solve complex tasks and queries in distributed system.A synthetic processing topology strategy is provided to merge streams on partial repeated computing of overlap data.Early results with statistical estimations are continuously sent to users. Interactive query processing aims at generating approximate results with minimum response time. However, it is quite difficult for a batch-oriented processing system to progressively provide cumulatively accurate results in the context of a distributed environment. MapReduce Online extends the MapReduce framework to support online aggregation, but it is hindered by its processing speed in keeping up with ongoing real-time data events. We deploy the online aggregation algorithm over S4, a scalable stream processing system that is inspired by the combined functionalities of MapReduce and Actor model. Our system applies an asynchronous message communication mechanism from actor model to support online aggregation. It can process large scale data stream with high concurrency in a short response time. In this system, we adopt a distributed weighted random sampling algorithm to solve biased distribution between different streams. Furthermore, a multi-level query processing topology is developed to reduce overlapped processing for multiple queries. Our system can provide continuous window aggregation with a confidence interval and error bound. We have implemented our system and conducted plentiful experiments over the TPC-H benchmark. A large number of experiments are carried out to demonstrate that by using our system, high-quality query results can be generated within a short response time and that the approach outperforms MapReduce Online on data streams.

Distributed Backup Data Deduplication System Based on Data Routing

Research on Data Routing Strategy of Deduplication in Cloud Environment

Ss-Dedup : A High Throughput Stateful Data Routing Algorithm For Cluster Deduplication System

A Delayed Container Organization Approach to Improve Restore Speed for Deduplication Systems.

QDFS: A Quality-Aware Distributed File Storage Service Based on HDFS

A Remote Data Backup System with Deduplication

Droplet: A Distributed Solution of Data Deduplication

Boafft: Distributed Deduplication for Big Data Storage in the Cloud

MassStore: A Low Bandwidth, High De-duplication Efficiency Network Backup System

Decentralized and Privacy Sensitive Data De-Duplication Framework for Convenient Big Data Management in Cloud Backup Systems

A Novel Optimization Method to Improve De-duplication Storage System Performance

Design and Performance Evaluation of Backup Routing Mechanism Based on DSR Routing Protocol

Heterogeneous Data Backup Against Early Warning Disasters in Geo-Distributed Data Center Networks

Progressive online aggregation in a distributed stream system

Router supported data regeneration protocols in distributed storage systems

A Data-Aware Remote Procedure Call Method for Big Data Systems

Router-supported data regeneration in distributed storage systems

Data Deduplication Techniques for Big Data Storage Systems

Adaptive Pipeline for Deduplication

Efficient Hybrid Inline and Out-of-line Deduplication for Backup Storage

PeerDedupe: Insights into the Peer-Assisted Sampling Deduplication.