WATSON: A Workflow-based Data Storage Optimizer for Analytics

Jia Zou,Ming Zhao
2020-01-01
Abstract:This paper studies the automatic optimization of data placement parameters for the inter-job write once read many (WORM) scenario where data is first materialized to storage by a producer job, and then accessed for many times by one or more consumer jobs. Such scenario is ubiquitous in Big Data analytics applications but existing Big Data auto-tuning techniques are often focused on single job performance. To address the shortcomings in existing works, this paper investigates data placement parameters regarding blocking, partitioning and replication and models the trade-offs caused by different configurations of these parameters through a producerconsumer model. We then present a novel cross-layer solution, WATSON, which can automatically predict future workloads’ data access patterns and tune data placement parameters accordingly to optimize the performance for an inter-job WORM scenario. WATSON can achieve up to eight times performance speedup on various analytics workloads.
What problem does this paper attempt to address?