Abstract:Mainstream chip multiprocessors already include a significant number of cores that make straightforward snooping-based cache coherence less appropriate. Further increase in core count will almost certainly require more sophisticated tracking of data sharing to minimize unnecessary messages and cache snooping. Directory-based coherence has been the standard solution for large-scale shared-memory multiprocessors and is a clear candidate for on-chip coherence maintenance. A vanilla directory design, however, suffers from inefficient use of storage to keep coherence metadata. The result is a high storage overhead for larger scales. Reducing this overhead leads to saving of resources that can be redeployed for other purposes. In this paper, we exploit familiar characteristics of coherence metadata, but with novel angles and propose two practical techniques to increase the expressiveness of directory entries, particularly for chip-multiprocessors. First, it is well known that the vast majority of cache lines have a small number of sharers. We exploit a related fact with a subtle but important difference: that a significant portion of directory entries only need to track one node. We can thus use a hybrid representation of sharers list for the directory. Second, contiguous memory regions often share the same coherence characteristics and can be tracked by a single entry. We propose an adaptive multi-granular mechanism that does not rely on any profiling, compiler, or operating system support to identify such regions. Moreover, it allows co-existence of line and region entries in the same locations, thus making regions more applicable. We show that both techniques improve the expressiveness of directory entries, and, when combined, can reduce directory storage by more than an order of magnitude with negligible loss of precision.

A Comprehensive Methodology to Determine Optimal Coherence Interfaces for Many-Accelerator SoCs.

Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCs

Building Expressive and Area-Efficient Directories with Hybrid Representation and Adaptive Multi-Granular Tracking

An accelerator-aware microarchitecture simulator for design space exploration

Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device

Allo: A Programming Model for Composable Accelerator Design

Synchronization Coherence: A Transparent Hardware Mechanism For Cache Coherence And Fine-Grained Synchronization

Analytical Modeling the Multi-Core Shared Cache Behavior with Considerations of Data-Sharing and Coherence

Phase-Priority based Directory Coherence for Multicore Processor

gem5-NVDLA: A Simulation Framework for Compiling, Scheduling and Architecture Evaluation on AI System-on-Chips

Automated Communication and Floorplan-Aware Hardware/Software Co-Design for SoC

AHA: An Agile Approach to the Design of Coarse-Grained Reconfigurable Accelerators and Compilers

Evaluating the Performance of Software Cache Coherence

Optimal Placement of Cores, Caches and Memory Controllers in Network On-Chip

Huicore: A Generalized Hardware Accelerator for Complicated Functions

PEPCP: A Power-Efficient Parallel Coherence Protocol for Large-Scale Network-on-Chip

ECI: a Customizable Cache Coherency Stack for Hybrid FPGA-CPU Architectures

Unveiling the Advantages of Full Coherency Architecture for FPSoC Systems

Collaborative Heterogeneity-Aware OS Scheduler for Asymmetric Multicore Processors

Optimizing Offload Performance in Heterogeneous MPSoCs

Design Space Exploration of HW Accelerators and Network Infrastructure for FPGA-Based MPSoC