Abstract:Mainstream chip multiprocessors already include a significant number of cores that make straightforward snooping-based cache coherence less appropriate. Further increase in core count will almost certainly require more sophisticated tracking of data sharing to minimize unnecessary messages and cache snooping. Directory-based coherence has been the standard solution for large-scale shared-memory multiprocessors and is a clear candidate for on-chip coherence maintenance. A vanilla directory design, however, suffers from inefficient use of storage to keep coherence metadata. The result is a high storage overhead for larger scales. Reducing this overhead leads to saving of resources that can be redeployed for other purposes. In this paper, we exploit familiar characteristics of coherence metadata, but with novel angles and propose two practical techniques to increase the expressiveness of directory entries, particularly for chip-multiprocessors. First, it is well known that the vast majority of cache lines have a small number of sharers. We exploit a related fact with a subtle but important difference: that a significant portion of directory entries only need to track one node. We can thus use a hybrid representation of sharers list for the directory. Second, contiguous memory regions often share the same coherence characteristics and can be tracked by a single entry. We propose an adaptive multi-granular mechanism that does not rely on any profiling, compiler, or operating system support to identify such regions. Moreover, it allows co-existence of line and region entries in the same locations, thus making regions more applicable. We show that both techniques improve the expressiveness of directory entries, and, when combined, can reduce directory storage by more than an order of magnitude with negligible loss of precision.

Grouping Cores for Chip Multiprocessors Optimization

Hierarchical Cache Directory for CMP.

Building Expressive and Area-Efficient Directories with Hybrid Representation and Adaptive Multi-Granular Tracking

Network caching for Chip Multiprocessors

L1 Collective Cache: Managing Shared Data for Chip Multiprocessors

CMP Thread Assignment Based on Group Sharing L2 Cache

An Efficient Lightweight Shared Cache Design for Chip Multiprocessors

Cluster Cache Monitor: Leveraging the Proximity Data in CMP

Bayesian Theory Based Adaptive Proximity Data Accessing For Cmp Caches

CCNoC: Cache-Coherent Network on Chip for Chip Multiprocessors.

Fast Hierarchical Cache Directory: A Scalable Cache Organization for Large-Scale CMP

Load Balance Scheduling Algorithm for CMP Architecture

Network Victim Cache: Leveraging Network-on-Chip for Managing Shared Caches in Chip Multiprocessors

Optimal Placement of Cores, Caches and Memory Controllers in Network On-Chip

Cluster Cache Monitor

Minimizing Bank Conflict Delay for Real-Time Embedded Multicore Systems via Bank Mapping.

Leveraging On-Chip Networks For Data Cache Migration In Chip Multiprocessors

Bayesian Theory Oriented Optimal Data-Provider Selection for CMP

Near Data Computation for Message-Passing Chip-Multiprocessors.

PASCMP: A Novel Cache Framework for Data Mining Application

Function Units Sharing Between Neighbor Cores in CMP.