Abstract:SpeedStream is a universal distributed platform that can handle with massive data flows with the features of low coupling, high availability, low latency and high scalability. Focusing on the core technologies of real-time stream computing platform in cloud environment, this paper conducts a series of researches and implementation of the system. First of all, aiming at the availability of real-time streaming computing platform, we design a high availability framework based on Zookeeper. It ensures fault detection and recovery of process level and node level timely by monitoring heartbreak of each modules and strategy of fault migration. Secondly, in order to increase the application types of the platform, by means of directed cycle detection and iteration protection, we design a real-time streaming computing model that based on directed graph with sources and sinks, which can not only satisfy the needs of common DAG computing services, but also support iteration computing services including directed cycle, bidirectional arcs and annular arcs. In addition, the platform can realize personalized task scheduling strategy for users by establishing task allocation matrix and optimize task allocation model. Finally, in order to solve the many-to-many dynamic load-balancing between tasks, we apply scheduler with status and distributed session table. It overcomes the difficulty of maintaining consistency of session without global session table. We also testified the convergence of this method. The experiment indicates that the throughput and data processing delay of SpeedStream are superior to other alternatives in dealing with the businesses of iteration applications, high traffic fluctuation applications, and high demand of load-balancing applications. This platform provides reliable, universal, and real-time solutions to process massive data flows, such as to process the real-time trading data in e-commerce, to analyze sensing flow in internet of things, and monitor traffics of the Internet.

Typical Big Data Computing Frameworks

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Visual Analysis of Cloud Computing Performance Using Behavioral Lines

Real-Time Big Data Processing Framework: Challenges and Solutions

Exploring Real-Time Data Processing Using Big Data Frameworks

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster Architecture

Real-time intelligent big data processing: technology, platform, and applications

Real-time Intelligent Big Data Processing:Technology, Platform, and Applications

Big Data Analytics Using Cloud Computing Based Frameworks for Power Management Systems: Status, Constraints, and Future Recommendations

Characterization and Architectural Implications of Big Data Workloads

Evaluation and Analysis of Distributed Graph-Parallel Processing Frameworks

Understanding Big Data Analytic Workloads on Modern Processors

NO2: Speeding Up Parallel Processing of Massive Compute-Intensive Tasks

Evaluating New Approaches of Big Data Analytics Frameworks

The anatomy of big data computing

Large-scale Real-time Data-driven Scientific Applications

SpeedStream: A real-time stream data processing platform in the cloud.

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

A Survey on Geographically Distributed Big-Data Processing using MapReduce