Abstract:As the number of concurrently running applications on the chip multiprocessors (CMPs) is increasing, efficient management of the shared last-level cache (LLC) is crucial to guarantee overall performance. Recent studies have shown that cache partitioning can provide benefits in throughput, fairness and quality of service. Most prior arts apply true Least Recently Used (LRU) as the underlying cache replacement policy and rely on its stack property to work properly. However, in commodity processors, pseudo-LRU policies without stack property are commonly used instead of LRU for their simplicity and low storage overhead. Therefore, this study sets out to understand whether LRU-based cache partitioning techniques can be applied to commodity processors. In this work, we instead propose a cache partitioning mechanism for two popular pseudo-LRU policies: Not Recently Used (NRU) and Binary Tree (BT). Without the help of true LRU's stack property, we propose a profiling logic that applies curve approximation methods to derive the hit curve (hit counts under varied way allocations) for an application. We then propose a hybrid partitioning mechanism, which mitigates the gap between the predicted hit curve and the actual statistics. Simulation results demonstrate that our proposal can improve throughput by 15.3% on average and outperforms the stack-estimate proposal by 12.6% on average. Similar results can be achieved in weighted speedup. For the cache configurations under study, it requires less than 0.5% storage overhead compared to the last-level cache. In addition, we also show that profiling mechanism with only one true LRU ATD achieves comparable performance and can further reduce the hardware cost by nearly two thirds compared with the hybrid mechanism.

Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management

Time-sharing Parallel Applications Through Performance-Targeted Feedback-Controlled Real-Time Scheduling.

Toward Full-Coverage and Low-Overhead Profiling of Network-Stack Latency

Reducing Scalability Collapse Via Requester-Based Locking on Multicore Systems

SafeTail: Efficient Tail Latency Optimization in Edge Service Scheduling via Computational Redundancy Management

TailCutter: Wisely Cutting Tail Latency in Cloud CDNs under Cost Constraints

iBalancer: Load-Aware in-Server Flow Scheduling for Sub-Millisecond Tail Latency

Software-Based Lightweight Multithreading to Overlap Memory-Access Latencies of Commodity Processors

Collaborative Cloud and Edge Computing for Latency Minimization

PCS: Predictive Component-level Scheduling for Reducing Tail Latency in Cloud Online Services

Optimizing Tail Latency by Critical Window-based Dynamic Cache Space Allocation

Memory at Your Service: Fast Memory Allocation for Latency-critical Services

Reducing overheads of dynamic scheduling on heterogeneous chips

Tail-Learning: Adaptive Learning Method for Mitigating Tail Latency in Autonomous Edge Systems

PseudoNUMA for Reducing Memory Interference in Multi-Core Systems.

SWP: Microsecond Network SLOs Without Priorities

Short Tail: Taming Tail Latency for Erasure-Code-based In-Memory Systems.

TADC: Thread-aware Divide-and-Conquer Policy to Manage Shared Cache

PMR: Priority Memory Reclaim to Improve the Performance of Latency-Critical Services

Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data Centers

Improving Cache Partitioning Algorithms For Pseudo-Lru Policies