Abstract:The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the internet-of-things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this work, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22nm FDX technology and evaluated performance and energy-efficiency with tunable microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41 times smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the microbenchmarks to 10% synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92% and 23% on average and energy efficiency by up to 98% and 39% on average.

Moped: Orchestrating Interprocess Message Data On Cmps

MOPED: Accelerating Data Communication on Future CMPs

DEAM：Decoupled, Expressive, Area-Efficient Metadata Cache

Accelerating Data Movement on Future Chip Multi-Processors

Hardware Support for Message-Passing in Chip Multi-Processors.

Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

Test of the MPCore cache bandwidthand considerations for efficient software execution

Direct Distributed Memory Access for CMPs

Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

MeMPA: A Memory Mapped M-SIMD Co-Processor to Cope with the Memory Wall Issue

Near Data Computation for Message-Passing Chip-Multiprocessors.

MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

Cache coherence protocol and implementation for multiprocessors with no-write-allocate caches

Molecular Dynamics Simulation On Commodity Shared-Memory Multiprocessor Systems With Lightweight Multithreading

An 800mhz 320mw 16-Core Processor with Message-Passing and Shared-Memory Inter-Core Communication Mechanisms

PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices

Grouping Cores for Chip Multiprocessors Optimization

Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing

A Software-Controlled Cache Coherence Optimization for Snoopy-Based SMP System

Phase-Priority based Directory Coherence for Multicore Processor