Towards Data Intensive Many-Task Computing

Ioan Raicu,Ian Foster,Yong Zhao,Alex Szalay,Philip Little,Christopher M. Moretti,Amitabh Chaudhary,Douglas Thain
DOI: https://doi.org/10.4018/978-1-61520-971-2.ch002
2009-01-01
Abstract:Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configurations) do not scale to today’s largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems. In this chapter, the authors argue that in such circumstances, data locality is critical to the successful and efficient use of large distributed systems for data-intensive applications. They propose a “data diffusion” approach to enable data-intensive many-task computing. They define an abstract model for data diffusion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demonstrating approaches that improve both performance and scalability.
What problem does this paper attempt to address?