Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

Chi Zhang,Paul Scheffler,Thomas Benz,Matteo Perotti,Luca Benini
2023-11-17
Abstract:Sparse matrix vector multiplication (SpMV) is central to numerous data-intensive applications, but requires streaming indirect memory accesses that severely degrade both processing and memory throughput in state-of-the-art architectures. Near-memory hardware units, decoupling indirect streams from processing elements, partially alleviate the bottleneck, but rely on low DRAM access granularity, which is highly inefficient for modern DRAM standards like HBM and LPDDR. To fully address the end-to-end challenge, we propose a low-overhead data coalescer combined with a near-memory indirect streaming unit for AXI-Pack, an extension to the widespread AXI4 protocol packing narrow irregular stream elements onto wide memory buses. Our combined solution leverages the memory-level parallelism and coalescence of streaming indirect accesses in irregular applications like SpMV to maximize the performance and bandwidth efficiency attained on wide memory interfaces. Our solution delivers an average speedup of 8x in effective indirect access, often reaching the full memory bandwidth. As a result, we achieve an average end-to-end speedup on SpMV of 3x. Moreover, our approach demonstrates remarkable on-chip efficiency, requiring merely 27kB of on-chip storage and a very compact implementation area of 0.2-0.3mm^2 in a 12nm node.
Hardware Architecture
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance bottleneck in Sparse Matrix - Vector Multiplication (SpMV) due to indirect memory access under modern high - bandwidth memory interfaces such as HBM and LPDDR. Specifically, the SpMV operation is very important in many data - intensive applications, but the existing general - purpose architectures are inefficient in handling this operation, mainly because indirect address access and irregular, non - continuous vector element access patterns lead to low memory bandwidth utilization, cache pollution and increased access latency. The paper proposes a low - overhead data coalescer, combined with a near - memory indirect flow unit, for the AXI - Pack protocol extension, aiming to maximize performance and bandwidth efficiency by efficiently coalescing streaming indirect access, especially on wide memory interfaces. This method not only improves the effective bandwidth of indirect access, but also reduces the demand for on - chip resources, thereby significantly improving the overall performance of SpMV without a large increase in hardware overhead.