Abstract:SpeedStream is a universal distributed platform that can handle with massive data flows with the features of low coupling, high availability, low latency and high scalability. Focusing on the core technologies of real-time stream computing platform in cloud environment, this paper conducts a series of researches and implementation of the system. First of all, aiming at the availability of real-time streaming computing platform, we design a high availability framework based on Zookeeper. It ensures fault detection and recovery of process level and node level timely by monitoring heartbreak of each modules and strategy of fault migration. Secondly, in order to increase the application types of the platform, by means of directed cycle detection and iteration protection, we design a real-time streaming computing model that based on directed graph with sources and sinks, which can not only satisfy the needs of common DAG computing services, but also support iteration computing services including directed cycle, bidirectional arcs and annular arcs. In addition, the platform can realize personalized task scheduling strategy for users by establishing task allocation matrix and optimize task allocation model. Finally, in order to solve the many-to-many dynamic load-balancing between tasks, we apply scheduler with status and distributed session table. It overcomes the difficulty of maintaining consistency of session without global session table. We also testified the convergence of this method. The experiment indicates that the throughput and data processing delay of SpeedStream are superior to other alternatives in dealing with the businesses of iteration applications, high traffic fluctuation applications, and high demand of load-balancing applications. This platform provides reliable, universal, and real-time solutions to process massive data flows, such as to process the real-time trading data in e-commerce, to analyze sensing flow in internet of things, and monitor traffics of the Internet.

A Binary Feature Extraction Based Data Provenance System Implemented on Flink Platform.

s2p: Provenance Research for Stream Processing System

Practical Whole-System Provenance Capture

Dissemination of Anonymized Streaming Data.

Application of Provenance Service in Equipment Grid

Pipeline Provenance for Cloud‐based Big Data Analytics

LogProv: Logging Events As Provenance of Big Data Analytics Pipelines with Trustworthiness.

Trusted Provenance of Automated, Collaborative and Adaptive Data Processing Pipelines

Smart Public Transportation Sensing: Enhancing Perception and Data Management for Efficient and Safety Operations

Efficient Secure Data Provenance Scheme in Multimedia Outsourcing and Sharing

TaintStream: Fine-Grained Taint Tracking for Big Data Platforms Through Dynamic Code Translation

Research and Design of Real Time Big Data Visualization Analysis Platform Based on Flink

Retrofitting Applications with Provenance-Based Security Monitoring

TAGS: Real-time Intrusion Detection with Tag-Propagation-based Provenance Graph Alignment on Streaming Events

Blockchain-Based Secure Data Provenance for Cloud Storage.

A Flow Exporting System Building On Network Processor

Flurry: a Fast Framework for Reproducible Multi-layered Provenance Graph Representation Learning

Data Provenance Based System for Classification and Linear Regression in Distributed Machine Learning

TIFAflow: Enhancing Traffic Archiving System with Flow Granularity for Forensic Analysis in Network Security

An efficient architecture for processing real-time traffic data streams using apache flink

SpeedStream: A real-time stream data processing platform in the cloud.