Abstract:High-performance flash-based key-value stores in data-centers utilize large amounts of DRAM to cache hot data. However, motivated by the high cost and power consumption of DRAM, server designs with lower DRAM-per-compute ratio are becoming popular. These low-cost servers enable scale-out services by reducing server workload densities. This results in improvements to overall service reliability, leading to a decrease in the total cost of ownership (TCO) for scalable workloads. Nevertheless, for key-value stores with large memory footprints, these reduced DRAM servers degrade performance due to an increase in both IO utilization and data access latency. In this scenario, a standard practice to improve performance for sharded databases is to reduce the number of shards per machine, which degrades the TCO benefits of reduced DRAM low-cost servers. In this work, we explore a practical solution to improve performance and reduce the costs and power consumption of key-value stores running on DRAM-constrained servers by using Storage Class Memories (SCM). SCMs in a DIMM form factor, although slower than DRAM, are sufficiently faster than flash when serving as a large extension to DRAM. With new technologies like Compute Express Link, we can expand the memory capacity of servers with high bandwidth and low latency connectivity with SCM. In this article, we use Intel Optane PMem 100 Series SCMs (DCPMM) in AppDirect mode to extend the available memory of our existing single-socket platform deployment of RocksDB (one of the largest key-value stores at Meta). We first designed a hybrid cache in RocksDB to harness both DRAM and SCM hierarchically. We then characterized the performance of the hybrid cache for three of the largest RocksDB use cases at Meta (ChatApp, BLOB Metadata, and Hive Cache). Our results demonstrate that we can achieve up to 80% improvement in throughput and 20% improvement in P95 latency over the existing small DRAM single-socket platform, while maintaining a 43–48% cost improvement over our large DRAM dual-socket platform. To the best of our knowledge, this is the first study of the DCPMM platform in a commercial data center.

O ( n ) Key–value Sort with Active Compute Memory

Modeling and Benchmarking Computing-in-Memory for Design Space Exploration.

RC-NVM: Enabling Symmetric Row and Column Memory Accesses for In-memory Databases

Sorting with Asymmetric Read and Write Costs

A Performance Evaluation of DRAM Access for In-Memory Databases

LazySort: A customized sorting algorithm for non-volatile memory

Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions

System Evaluation of the Intel Optane Byte-addressable NVM

FAST: A Fully-Concurrent Access Technique to All SRAM Rows for Enhanced Speed and Energy Efficiency in Data-Intensive Applications

Demystifying the Performance of HPC Scientific Applications on NVM-based Memory Systems

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

Processing Data Where It Makes Sense: Enabling In-Memory Computation

Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product

Sky-sorter: A Processing-in-Memory Architecture for Large-Scale Sorting

A$^3$PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader

Power-optimized Deployment of Key-value Stores Using Storage Class Memory

Fast and reconfigurable sort-in-memory system enabled by memristors

Improving Performance of Flash Based Key-Value Stores Using Storage Class Memory As a Volatile Memory Extension

DReAM: Dynamic Re-arrangement of Address Mapping to Improve the Performance of DRAMs

CHIME: Energy-Efficient STT-RAM-based Concurrent Hierarchical In-Memory Processing