Abstract:Many applications of complex event processing (CEP) in Cloud can tolerate analytical errors to some extent, and it provides us an opportunity to optimize real-time analytics using methods of approximate query processing over big data streams. In this article, we present a novel rules-based sampling technique, which supports to construct sketch over one-pass and high-speed asynchronous data streams and provides accurate answers for different types of analytical queries. Moreover, we propose two methods of distributed sketching implementation, i.e., D-AQP<span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="0.938ex" height="1.676ex" style="vertical-align: -0.671ex;" viewBox="0 -432.6 403.7 721.6" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use transform="scale(0.707)" xlink:href="#MJMATHI-62" x="0" y="-213"></use></g></svg></span>b and D-AQP<span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="0.8ex" height="1.676ex" style="vertical-align: -0.671ex;" viewBox="0 -432.6 344.3 721.6" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use transform="scale(0.707)" xlink:href="#MJMATHI-69" x="0" y="-213"></use></g></svg></span>i, to make our approach to be compatible with batch processing and interactive processing architectures respectively, and be appropriate for stream processing systems in Cloud. Experimental results with real-world and synthetic datasets indicate that our approach can obtain more accurate estimates and improve two times of system throughput when compared with state-of-the-art Hadoop-based approximate engine BlinkDB. When compared with current batch processing systems Spark and stream processing system Spark-Streaming, our methods of D-AQP<span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="0.938ex" height="1.676ex" style="vertical-align: -0.671ex;" viewBox="0 -432.6 403.7 721.6" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use transform="scale(0.707)" xlink:href="#MJMATHI-62" x="0" y="-213"></use></g></svg></span>b and D-AQP<span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="0.8ex" height="1.676ex" style="vertical-align: -0.671ex;" viewBox="0 -432.6 344.3 721.6" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use transform="scale(0.707)" xlink:href="#MJMATHI-69" x="0" y="-213"></use></g></svg></span>i can achieve 2 and 4 orders of magnitude improvement on query response time respectively.<svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><defs id="MathJax_SVG_glyphs"><path stroke-width="1" id="MJMATHI-62" d="M73 647Q73 657 77 670T89 683Q90 683 161 688T234 694Q246 694 246 685T212 542Q204 508 195 472T180 418L176 399Q176 396 182 402Q231 442 283 442Q345 442 383 396T422 280Q422 169 343 79T173 -11Q123 -11 82 27T40 150V159Q40 180 48 217T97 414Q147 611 147 623T109 637Q104 637 101 637H96Q86 637 83 637T76 640T73 647ZM336 325V331Q336 405 275 405Q258 405 240 397T207 376T181 352T163 330L157 322L136 236Q114 150 114 114Q114 66 138 42Q154 26 178 26Q211 26 245 58Q270 81 285 114T318 219Q336 291 336 325Z"></path><path stroke-width="1" id="MJMATHI-69" d="M184 600Q184 624 203 642T247 661Q265 661 277 649T290 619Q290 596 270 577T226 557Q211 557 198 567T184 600ZM21 287Q21 295 30 318T54 369T98 420T158 442Q197 442 223 419T250 357Q250 340 236 301T196 196T154 83Q149 61 149 51Q149 26 166 26Q175 26 185 29T208 43T235 78T260 137Q263 149 265 151T282 153Q302 153 302 143Q302 135 293 112T268 61T223 11T161 -11Q129 -11 102 10T74 74Q74 91 79 106T122 220Q160 321 166 341T173 380Q173 404 156 404H154Q124 404 99 371T61 287Q60 286 59 284T58 281T56 279T53 278T49 278T41 278H27Q21 284 21 287Z"></path></defs></svg>

Efficient Data Blocking and Skipping Framework Applying Heuristic Rules

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

LBFM: Multi-Dimensional Membership Index for Block-Level Data Skipping

Probabilistic Parallelisation of Blocking Non-Matched Records for Big Data

A Stack-Centric Processing Model for Iterative Processing

An Adaptive Data Partitioning Scheme For Accelerating Exploratory Spark Sql Queries

Comparative Analysis of Energy-Efficient Scheduling Algorithms for Big Data Applications.

Handling Data Skew at Reduce Stage in Spark by ReducePartition

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Unsupervised Blocking and Probabilistic Parallelisation for Record Matching of Distributed Big Data

Efficient Shuffle Management for DAG Computing Frameworks Based on the FRQ Model

OPTIMIZATION FOR SPARK MISSION PERFORMANCE BASED ON DATA CHARACTERISTICS

A Sketching Approach for Obtaining Real-time Statistics over Data Streams in Cloud

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

Efficient shuffle management with SCache for DAG computing frameworks.

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

A Comparative Study of Data Skew in Hadoop

An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis

Optimization of Spark Storage Solutions

Reducing Head-of-Line Blocking on Network in Hadoop Clusters

LIBRA: Lightweight Data Skew Mitigation in MapReduce