Abstract:The large working sets of commercial and scientific workloads favor a shared L2 cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests in chip multiprocessors (CMP). There are two important hurdles that restrict the scalability of these chip multiprocessors: the on-chip memory cost of directory and the long L1 miss latencies. This work presents network caching architecture aimed at facing these two important problems. Network caching takes advantage of on-chip networks to manage shared data blocks and directory information in chip multiprocessors. The network caching architecture removes the directory structure from shared L2 caches and stores directory information for the blocks recently cached by L1 caches in the network interface components decreasing on-chip directory memory overhead and improves the scalability. The saved memory space is used as shared data caches or victim caches which are embedded into the network interface components to reduce L1 miss latencies further. This paper develops three network caching designs to reduce L1 miss latencies. The proposed architecture is evaluated based on simulations of a 16-core tiled CMP. First, we demonstrate that network caching architecture provides good scalability. Second, network caching architecture also provides robust performance. Third, different network caching designs have distinct impacts on performance of CMP. Against over the traditional shared L2 cache design, network victim cache (NVC) design improves performance by 23% on average, and up to 34% at best. Network shared cache (NSC) design provides performance improvement by 6% on average, and up to 16% at best. Network directory cache (NDC) design achieves performance improvement by 4% on average, and up to 11% at best.

CINOC: Computing in Network-On-Chip with Tiled Many-Core Architectures for Large-Scale General Matrix Multiplications

DaDianNao: A Machine-Learning Supercomputer

Novel many-core architecture design for real-time image processing

High Performance Matrix Multiplication on Many Cores

Network Victim Cache: Leveraging Network-on-Chip for Managing Shared Caches in Chip Multiprocessors

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

WWW: What, When, Where to Compute-in-Memory

Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference

Network caching for Chip Multiprocessors

A Novel Scheme to Map Convolutional Networks to Network-on-Chip with Computing-In-Memory Nodes

TensorCIM: Digital Computing-in-Memory Tensor Processor with Multichip-Module-Based Architecture for Beyond-NN Acceleration

Memory and Computation Coordinated Mapping of DNNs Onto Complex Heterogeneous SoC.

A Task-Adaptive In-Situ ReRAM Computing for Graph Convolutional Networks

MAICC : A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference.

Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors

A Customized NoC Architecture to Enable Highly Localized Computing-On-the-Move DNN Dataflow

An Efficient Lightweight Shared Cache Design for Chip Multiprocessors

Scale up your In-Memory Accelerator: Leveraging Wireless-on-Chip Communication for AIMC-based CNN Inference

COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning.

Domino: A Tailored Network-on-Chip Architecture to Enable Highly Localized Inter- and Intra-Memory DNN Computing

Optimal Placement of Cores, Caches and Memory Controllers in Network On-Chip