Exploring DRAM Cache Prefetching for Pooled Memory

Chandrahas Tirumalasetty,Narasimha Annapreddy
2024-06-21
Abstract:Hardware based memory pooling enabled by interconnect standards like CXL have been gaining popularity amongst cloud providers and system integrators. While pooling memory resources has cost benefits, it comes at a penalty of increased memory access latency. With yet another addition to the memory hierarchy, local DRAM can be potentially used as a block cache(DRAM Cache) for fabric attached memory(FAM) and data prefetching techniques can be used to hide the FAM access latency. This paper proposes a system for prefetching sub-page blocks from FAM into DRAM cache for improving the data access latency and application performance. We further optimize our DRAM cache prefetch mechanism through enhancements that mitigate the performance degradation due to bandwidth contention at FAM. We consider the potential for providing additional functionality at the CXL-memory node through weighted fair queuing of demand and prefetch requests. We compare such a memory-node level approach to adapting prefetch rate at the compute-node based on observed latencies. We evaluate the proposed system in single node and multi-node configurations with applications from SPEC, PARSEC, Splash and GAP benchmark suites. Our evaluation suggests DRAM cache prefetching result in 7% IPC improvement and both of proposed optimizations can further increment IPC by 7-10%.
Hardware Architecture
What problem does this paper attempt to address?
This paper discusses how to utilize DRAM cache prefetching techniques to optimize data access latency and application performance based on CXL (Compute Express Link) memory pooling. With the cost advantages of memory resource pooling, but increased access latency, the paper proposes a system that prefetches sub-page blocks from FAM (Fabric Attached Memory) to DRAM cache to reduce data access latency. Additionally, they have reduced performance degradation due to FAM bandwidth competition through enhanced mechanisms. The optimizations proposed in the paper include: 1. DRAM cache prefetching mechanism: Utilizing a portion of local DRAM as a hardware-managed FAM cache, prefetching operations are performed when misses occur in the LLC (Last Level Cache). 2. Prefetch adaptivity mechanism: Adjusting the DRAM cache prefetch rate according to the congestion status of FAM to effectively manage FAM bandwidth. 3. CXL memory nodes based on Weighted Fair Queuing (WFQ): Evaluating the use of WFQ at the memory node level to improve the service of demand and prefetch requests, compared to the method of adjusting prefetch rates at the compute node based on observed latency. By evaluating single-node and multi-node configurations of the SPEC, PARSEC, Splash, and GAP benchmark suites, the paper demonstrates that DRAM cache prefetching can improve IPC (Instructions Per Cycle) by 7%, while the two proposed optimizations can further increase IPC by 7-10%. The research primarily targets the new requirements of modern workloads on memory systems, especially the demand for large amounts of data by machine learning applications, and the challenges brought by memory decentralization. Through prefetching and optimization, the paper aims to improve the efficiency and performance of memory systems in data centers.