Abstract:Traditional High-Performance Computing (HPC) based big-data applications are usually constrained by having to move large amount of data to compute facilities for real-time processing purpose. Modern HPC systems, represented by High-Throughput Computing (HTC) and Many-Task Computing (MTC) platforms, on the other hand, intend to achieve the long-held dream of moving compute to data instead. This kind of data-aware scheduling, typically represented by Hadoop MapReduce, has been successfully implemented in its Map Phase, whereby each Map Task is sent out to the compute node where the corresponding input data chunk is located. However, Hadoop MapReduce limits itself to a one-map-to-one-reduce framework, leading to difficulties for handling complex logics, such as pipelines or workflows. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among multiple applications and/or jobs. The performance can be improved significantly when the knowledge of the shared and frequently accessed data is taken into scheduling decisions.To enhance the capability of managing workflow in modern HPC system, this paper presents CloudFlow, a Hadoop MapReduce based programming model for cloud workflow applications. CloudFlow is built on top of MapReduce, which is proposed not only being data aware, but also shared-data aware. It identifies the most frequently shared data, from both task-level and job-level, replicates them to each compute node for data locality purposes. It also supports user-defined multiple Map- and Reduce functions, allowing users to orchestrate the required data-flow logic. Mathematically, we prove the correctness of the whole scheduling framework by performing theoretical analysis. Further more, experimental evaluation also shows that the execution runtime speedup exceeds 4X compared to traditional MapReduce implementation with a manageable time overhead. (C) 2014 Elsevier B.V. All rights reserved.

Ease the Process of Machine Learning with Dataflow

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

A Distributed and Scalable Machine Learning Approach for Big Data

Kubeflow-based Automatic Data Processing Service for Data Center of State Grid Scenario

PipeFlow Engine: Pipeline Scheduling with Distributed Workflow Made Simple

Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows

Dflow, a Python framework for constructing cloud-native AI-for-Science workflows

Progressive online aggregation in a distributed stream system

FLOWPROPHET: Generic and Accurate Traffic Prediction for Data-Parallel Cluster Computing

Cloudflow: A Data-Aware Programming Model for Cloud Workflow Applications on Modern Hpc Systems

BigDataflow: A Distributed Interprocedural Dataflow Analysis Framework

Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

Streaming Task Graph Scheduling for Dataflow Architectures

Towards better data discovery and collection with flow-based programming

STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics

swFLOW: A large-scale distributed framework for deep learning on Sunway TaihuLight supercomputer

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

SQLFlow: A Bridge between SQL and Machine Learning

Asynchronous Complex Analytics in a Distributed Dataflow Architecture