Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization

Lei Chen,Xiong Gao,Ruijie Xia,Bin Cheng,Zhaochuang Zhang,Zhen Geng,Renwei Zhang,Xuefeng Jin,Jie Zhao,Deshi Chen
Abstract:We study fusion for deep neural networks (DNNs) in a just-in-time (JIT) compilation framework A POLLO . It considers both memory-and compute-bound tensor operators for fusion, and integrates graph-level node grouping and operator-level loop fusion closely, widening the fusion search space. A POLLO enables the upward feed-back from the downstream loop optimizer, enforcing the graph engine to regenerate partition patterns amenable to the downstream pass and thus resolving the scalability issue. Besides data locality, A POLLO also exploits the parallelism between independent tensor operators, further improving the performance of DNN workloads. Experimental results on training workloads show that A POLLO outperforms TensorFlow and XLA by 1.86 × and 1.37 × on a single GPU, and 1.96 × and 1.18 × on multiple GPUs. A POLLO also improves the performance of a vendor-provided DNN framework by 19.7% on a domain-specific accelerator. In addition, the results of inference workloads demonstrate the general applicability of our fusion framework.
Computer Science
What problem does this paper attempt to address?