Automatic Deep Learning Operator Fusion on Sunway SW26010 Many-Core Processor

Wei Gao,Wenxiang Zhang,Wenzhao Wu,Yanjie Zhen,Wenlai Zhao,Guangwen Yang
DOI: https://doi.org/10.1109/icpads60453.2023.00266
2023-01-01
Abstract:Deep learning networks (DNNs) have been growing rapidly in recent years, with increasing demands on computing power. Therefore, accelerating the execution of DNN models has become a research hotspot. Operator fusion is a critical optimization strategy to enhance DNN performance in Deep Learning (DL) frameworks, such as TensorFlow, Pytorch, TVM and Halide. However, these frameworks are designed for general optimization and cannot fully harness the specific features of emerging hardware. Moreover, they primarily implement operator fusion at the operator level, missing out on many fusion opportunities and heavily relying on extensive manual optimizations for fused operators. Targeting the Sunway SW26010 Many-Core processor, the basic building block of Sunway TaihuLight supercomputer, we introduce swAutoFuser, an end-to-end automatic operator fusion and code generation framework. swAutoFuser proposes a set of low-level primitives to leverage hardware features and employs an autofuser to achieve primitive level fusion, which breaks operator boundaries and enables more fusion opportunities. In addition, swAutoFuser can automatically generate high-performance fused operator implementations based on a static cost model, significantly reducing the overhead of manually optimizing fused operators. Our experiments demonstrate that swAutoFuser can improve operator performance by 10% to 56%.
What problem does this paper attempt to address?