CINOC: Computing in Network-On-Chip with Tiled Many-Core Architectures for Large-Scale General Matrix Multiplications

Yao Qin,Mingyu Wang,Jiahua Yan,Tao Lu,Zhiyi Yu
DOI: https://doi.org/10.1109/tcsi.2024.3466217
2024-01-01
Abstract:Large-scale general matrix multiplications (LMMs) are the key bottlenecks in various computation domains such as Transformer applications. However, it is a challenge to perform LMMs efficiently on traditional multi/many-core processor systems due to the large amount of memory access and the tight dependence of data transmission. By analyzing the aforementioned problems, we propose a computing in network-on-chip paradigm to perform LMMs by mitigating the performance losses caused by limited on-chip cache resources and memory bandwidth. Specifically, we propose a co-design of computable network-on-chip and the last-level cache method in tiled many-core architectures, which can reconstruct the redundant cache capacity as computable input buffer to balance the demands of computing, storage, and communication for the running LMM applications. Furthermore, a data-aware thread execution mechanism is also proposed to maximize the computational efficiency of thread streams in computable network. At the software level, memory-friendly matrix partitioning strategy, hybrid routing method and programming model are designed to bridge the gap between application demands and mismatched hardware/software interfaces. Experimental evaluations demonstrate that this proposed work achieves a computational latency reduction of 45% compared to the state-of-the-art GPU architecture, and the inference performance is improved by 2 $\times$ of the GPT network.
What problem does this paper attempt to address?