Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference
Yuanbo Wen,Qi Guo,Zidong Du,Jianxing Xu,Zhenxing Zhang,Xing Hu,Wei Li,Rui Zhang,Chao Wang,Xuehai Zhou,Tianshi Chen,Zhou Xuehai
DOI: https://doi.org/10.1109/tc.2021.3128266
IF: 3.183
2021-01-01
IEEE Transactions on Computers
Abstract:Machine Learning Computers (MLCs) with tensor functional units (e.g., NVIDIA’s Tensor Core, Google’s TPU and Habana’s Tensor Processor Core) have emerged significantly over recent years. The broad diversity of MLCs makes it hard to deploy machine learning workloads with optimized performance. Though deep learning compilers (e.g., TVM) are effective to produce optimized code for different hardware back-ends, when deploying to a new MLC, it is tedious to implement platform-specific compilation optimizations by thoroughly understanding system/architectural details. To address this problem, we propose a holistic approach to achieve one-size-fits-all compilation optimization for inference across different MLCs. The key observation is that diverse MLCs share multiple key architectural characteristics (e.g., tensor primitives and on-chip scratchpad memory) for tensor processing, which can be generalized for conducting cross-platform compilation optimizations. Concretely, we propose the Tensor Abstract Machine (TAM), which features such common architectural characteristics, as the abstraction of a broad range of MLCs. To leverage architectural characteristics of the TAM, we propose the Tensor Scheduling Language (TSL) consisting of tensor computation description and tensor scheduling primitives for implementing operations with portable optimization. By implementing tensor operations with TSL, the related optimized code for different MLCs can be automatically generated. To validate our proposal, we conduct experiments on 3 commodity MLCs including GPU with Tensor Cores, VTA (on FPGA), and Cloud TPU. Experimental results demonstrate that the code generated from the same optimization schedule achieves 1.05x to 2.05x better performance than hand-tuned libraries and deep learning compilers across different platforms.
engineering, electrical & electronic,computer science, hardware & architecture