Abstract:Cache plays a critical role in reducing the performance gap between CPU and main memory. A modern multi-core CPU generally employs a multi-level hierarchy of caches, through which the most recently and frequently used data are maintained in each core's local private caches while all cores share the last-level cache (LLC). For inclusive caches, clean cache lines replaced in higher-level caches are not necessarily copied back to lower levels, as the inclusiveness implies their existences in lower levels. For exclusive and non-inclusive caches that are widely utilized by Intel, AMD, and ARM today, either indiscriminately copying back all or none of replaced clean cache lines to lower levels raises no violation to exclusiveness and non-inclusiveness definitions. We have conducted a quantitative study and found that, copying back all or none of clean cache lines to lower-level cache of exclusive caches entails suboptimal performance. The reason is that only a part of cache lines would be reused and others turn to be dead in a long run. This observation motivates us to selectively copy back some clean cache lines to LLC in an architecture of exclusive or non-inclusive caches. We revisit the concept of reuse distance of cache lines. In a nutshell, a clean cache line with a shorter reuse distance is copied back to lower-level cache as it is likely to be re-referenced in the near future, while cache lines with much longer reuse distances would be discarded or sent to memory if they are dirty. We have implemented and evaluated our proposal with non-volatile (STT-MRAM) LLC. Experimental results with gem5 and SPEC CPU 2017 benchmarks show that on average our proposal yields up to 12.8% higher throughput of IPC (instructions per cycle) than the least-recently-used (LRU) replacement policy with copying back all clean cache lines for STT-MRAM LLC.

LLC Buffer for Arbitrary Data Sharing in Heterogeneous Systems.

Buffer on Last Level Cache for CPU and GPGPU Data Sharing

Enable Back Memory and Global Synchronization on LLC Buffer

Buffer Filter: A Last-Level Cache Management Policy for CPU-GPGPU Heterogeneous System

Analyzing Memory Access on CPU-GPGPU Shared LLC Architecture

Last Level Cache Layout Remapping for Heterogeneous Systems

LA-LLC: Inter-Core Locality-Aware Last-Level Cache to Exploit Many-to-Many Traffic in GPGPUs

WAP: the Warp Feature Aware Prefetching Method for LLC on CPU-GPU Heterogeneous Architecture

Exploring Time-Predictable and High-Performance Last-Level Caches for Hard Real-Time Integrated CPU-GPU Processors.

Improve Llc Bypassing Performance By Memory Controller Improvements In Heterogeneous Multicore System

5GC$^2$ache: Improving 5G UPF Performance via Cache Optimization

Cache Management with Partitioning-Aware Eviction and Thread-Aware Insertion/Promotion Policy

Predictable Sharing of Last-level Cache Partitions for Multi-core Safety-critical Systems

Improving Cache Partitioning Algorithms For Pseudo-Lru Policies

Global Priority Table for Last-Level Caches

Reuse Distance-based Copy-backs of Clean Cache Lines to Lower-level Caches

CWFP: Novel Collective Writeback and Fill Policy for Last-Level DRAM Cache.

Cooperatively Managing Dynamic Writeback and Insertion Policies in a Last-Level DRAM Cache.

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Affinity-aware DMA Buffer Management for Reducing Off-Chip Memory Access

A Method for Hiding the Increased Non-Volatile Cache Read Latency