Abstract:As multicore systems continue to grow in scale and on-chip memory capacity, the on-chip network bandwidth and latency become problematic bottlenecks. Because of this, overheads in data transfer, the coherence protocol and replacement policies become increasingly important. Unfortunately, even in well-structured programs, many natural optimizations are difficult to implement because of the reactive and centralized nature of traditional cache hierarchies, where all requests are initiated by the core for short, cache line granularity accesses. For example, long-lasting access patterns could be streamed from shared caches without requests from the core. Indirect memory access can be performed by chaining requests made from within the cache, rather than constantly returning to the core. Our primary insight is that if programs can embed information about long-term memory stream behavior in their ISAs, then these streams can be floated to the appropriate level of the memory hierarchy. This decentralized approach to address generation and cache requests can lead to better cache policies and lower request and data traffic by proactively sending data before the cores even request it. To evaluate the opportunities of stream floating, we enhance a tiled multicore cache hierarchy with stream engines to process stream requests in last-level cache banks. We develop several novel optimizations that are facilitated by stream exposure in the ISA, and subsequent exposure to caches. We evaluate using a cycle-level execution-driven gem5-based simulator, using 10 data-processing workloads from Rodinia and 2 streaming kernels written in OpenMP. We find that stream floating enables 52% and 39% speedup over an inorder and OOO core with state of art prefetcher design respectively, with 64% and 49% energy efficiency advantage.

Stream Floating: Enabling Proactive and Decentralized Cache Optimizations

Cache streamization for high performance stream processor

DEAM：Decoupled, Expressive, Area-Efficient Metadata Cache

Scalable-Grain Pipeline Parallelization Method For Multi-Core Systems

The Case of Using Multiple Streams in Streaming

Software Managed Instruction Scratchpad Memory Optimization in Stream Architecture Based on Hot Code Analysis of Kernels.

Stream-Based Data Placement for Near-Data Processing with Extended Memory

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

StreamCache: Revisiting Page Cache for File Scanning on Fast Storage Devices.

Fine-Grained Multi-Query Stream Processing on Integrated Architectures

Tiled multi-core stream architecture

Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra

A Hardware/Software Method for Heterogeneous Cores Cooperating on Stream Architecture

MASA Stream Architecture and Evaluating for a Fluid Computing Application

Integrated Pipelined Task Scheduling and Core Mapping for Streaming Applications on Multi-Core Systems

CoopStream: A Cooperative Cache Based Streaming Schedule Scheme for On-demand Media Services on Overlay Networks

Towards Heterogeneous Multi-core Accelerators Exploiting Fine-grained Scheduling of Layer-Fused Deep Neural Networks

Throughput Optimization For Streaming Applications On Cpu-Fpga Heterogeneous Systems

StreamPIM: Streaming Matrix Computation in Racetrack Memory

Fast Parallel Stream Compaction for IA-Based Multi/many-core Processors