Abstract:distributed weighted random sampling algorithm in distributed stream systems is novel.Multi-level query processing is presented to solve complex tasks and queries in distributed system.A synthetic processing topology strategy is provided to merge streams on partial repeated computing of overlap data.Early results with statistical estimations are continuously sent to users. Interactive query processing aims at generating approximate results with minimum response time. However, it is quite difficult for a batch-oriented processing system to progressively provide cumulatively accurate results in the context of a distributed environment. MapReduce Online extends the MapReduce framework to support online aggregation, but it is hindered by its processing speed in keeping up with ongoing real-time data events. We deploy the online aggregation algorithm over S4, a scalable stream processing system that is inspired by the combined functionalities of MapReduce and Actor model. Our system applies an asynchronous message communication mechanism from actor model to support online aggregation. It can process large scale data stream with high concurrency in a short response time. In this system, we adopt a distributed weighted random sampling algorithm to solve biased distribution between different streams. Furthermore, a multi-level query processing topology is developed to reduce overlapped processing for multiple queries. Our system can provide continuous window aggregation with a confidence interval and error bound. We have implemented our system and conducted plentiful experiments over the TPC-H benchmark. A large number of experiments are carried out to demonstrate that by using our system, high-quality query results can be generated within a short response time and that the approach outperforms MapReduce Online on data streams.

Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark

Sparksw: Scalable Distributed Computing System For Large-Scale Biological Sequence Alignment

SCAN: A Smart Application Platform for Empowering Parallelizations of Big Genomic Data Analysis in Clouds

DSA: Scalable Distributed Sequence Alignment System Using SIMD Instructions.

Distributed High-Dimension Matrix Operation Optimization on Spark

GPU Accelerated Biological Sequence Alignment

Accelerating Biological Sequence Alignment Algorithm on GPU with CUDA

Gene Sequence Alignment on a Public Computing Platform

Smith-Waterman Algorithm Based on SSE2

Distributed Gene Clinical Decision Support System Based on Cloud Computing

Progressive online aggregation in a distributed stream system

Bwasw-Cloud: Efficient sequence alignment algorithm for two big data with MapReduce

Characterization of Smith-Waterman Sequence Database Search in X10

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Implementation of the Smith-Waterman Algorithm on a Reconfigurable Supercomputing Platform

Hardware Acceleration for the Banded Smith-Waterman Algorithm with the Cycled Systolic Array

CUDASW++4.0: ultra-fast GPU-based Smith–Waterman protein sequence database search

Fast and Exact Sequence Alignment with the Smith-Waterman Algorithm: The SwissAlign Webserver

Efficient String Similarity Join in Multi-Core and Distributed Systems.

Distributed Sequence Alignment Applications for the Public Computing Architecture

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark