Abstract:Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory technologies posses: high density, low leakage, and non-volatility. However, the latency and energy overhead associated with the write operations of these emerging memories has become a major obstacle in their adoption. Previous works have proposed various circuit and architectural level solutions to mitigate the write overhead. In this paper, we study the integration of STT-RAM in a 3D multi-core environment and propose solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STT-RAM technology. Our scheme is based on the observation that instead of staggering requests to a write-busy STT-RAM bank, the network should schedule requests to other idle cache banks for effectively hiding the latency. Thus, we prioritize cache accesses to the idle banks by delaying accesses to the STT-RAM cache banks that are currently serving long latency write requests. Through a detailed characterization of the cache access patterns of 42 applications, we propose an efficient mechanism to facilitate such delayed writes to cache banks by (a) accurately estimating the busy time of each cache bank through logical partitioning of the cache layer and (b) prioritizing packets in a router requesting accesses to idle banks. Evaluations on a 3D architecture, consisting of 64 cores and 64 STT-RAM cache banks, show that our proposed approach provides 14% average IPC improvement for multi-threaded benchmarks, 19% instruction throughput benefits for multi-programmed workloads, and 6% latency reduction compared to a recently proposed write buffering mechanism.

A low‐latency memory‐cube network with dual diagonal mesh topology and bypassed pipelines

A Physical-Aware Framework for Memory Network Design Space Exploration

A Dual-Port Access Structure of 3D Mesh-Based NoC

MRouter: The Router Based on Memory Centric Mechanism

A Spatial-Designed Computing-In-Memory Architecture Based on Monolithic 3D Integration for High-Performance Systems.

Hybrid Cache Architecture for High Speed Packet Processing

NOM: Network-On-Memory for Inter-Bank Data Transfer in Highly-Banked Memories

Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for Hybrid Memory Cube

Network caching for Chip Multiprocessors

Direct Distributed Memory Access for CMPs

Dual-link interconnect architecture for 3-D Mesh-based network on chip

CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories

Combinatorics and Geometry for the Many-ported, Distributed and Shared Memory Architecture

Memory Centric Interconnection Mechanism for Message Passing in Parallel Systems

Architecture design and performance analysis of a novel memory system for high-bandwidth onboard switching fabric

Domino: A Tailored Network-on-Chip Architecture to Enable Highly Localized Inter- and Intra-Memory DNN Computing

Dualbless: Bufferless Router with Dual Ejection Ports for 2d and 3d Noc

CXL over Ethernet: A Novel FPGA-based Memory Disaggregation Design in Data Centers

Network Victim Cache: Leveraging Network-on-Chip for Managing Shared Caches in Chip Multiprocessors

A New Optical Network-on-Chip Architecture for Chip Multiprocessor

Architecting On-Chip Interconnects for Stacked 3D STT-RAM Caches in CMPs