Architecture of a distributed storage that combines file system, memory and computation in a single layer

Jia Zou,Arun Iyengar,Chris Jermaine
DOI: https://doi.org/10.1007/s00778-020-00605-w
2020-02-26
The VLDB Journal
Abstract:Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper, we propose a single system called Pangea that can manage all data—both intermediate and long-lived data, and their buffer/caching, page replacement, data placement optimization, and failure recovery—all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.
What problem does this paper attempt to address?