PipeFlow Engine: Pipeline Scheduling with Distributed Workflow Made Simple

Yin Li,Chuang Lin
DOI: https://doi.org/10.1109/icpads.2013.31
2013-01-01
Abstract:Distributed computing system is considered as a fundamental architecture to extend resources such as computation speed, storage capacity, and network bandwidth, which are limited for a single processor. Emerging big data processing techniques like Hadoop take advantages of distributed servers to accomplish scalable parallel computations. Large-scale processing jobs can run on different servers or even different clusters interdependently and be combined together as a workflow to provide meaningful outputs. In this paper, we analyze the common demands of big-data processing and distributed big-data workflow processing. According to that, we design Pipe Flow Engine that has the matching features to meet each of these demands. It orchestrates all involved jobs and schedules them in a batched pipeline mode. We also present two online ranking algorithms that make use of the Pipe Flow, sharing the experience and best practice of using Pipe Flow.
What problem does this paper attempt to address?