Enabling Efficient Erasure Coding in Disaggregated Memory Systems

Qiliang Li,Liangliang Xu,Yongkun Li,Min Lyu,Wei Wang,Pengfei Zuo,Yinlong Xu
DOI: https://doi.org/10.1109/tpds.2023.3332782
IF: 5.3
2024-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Disaggregated memory (DM) separates compute and memory resources to build a huge memory pool. Erasure coding (EC) is expected to provide fault tolerance in DM with low memory cost. In DM with EC, objects are first coded in compute servers, then directly written to memory servers via high-speed networks like one-sided RDMA. However, as the one-sided RDMA latency goes down to the microsecond level, coding overhead degrades the performance in DM with EC. To enable efficient EC in DM, we thoroughly analyze the coding stack from the perspective of cache efficiency and RDMA transmission. We develop MicroEC, which optimizes the coding workflow by reusing the auxiliary coding data and coordinates the coding and RDMA transmission with an exponential pipeline, as well as carefully adjusting the coding and transmission threads to minimize the latency. We implement a prototype supporting common basic operations, such as write/read/degraded read/recovery. Experiments show that MicroEC reduces the write latency by up to 44.35% and 42.14% and achieves up to $1.80\times$ and $1.73\times$ write throughput, compared with the state-of-the-art DM systems with EC and 3-way replication for objects not smaller than 1 MB, respectively. For small objects, MicroEC also evidently reduces the variation of latency, e.g., it reduces the P99 latency of writing 1 KB objects by 27.81%.
What problem does this paper attempt to address?