A Scalable Multi-TeraOPS Core for AI Training and Inference

Sunil Shukla,Bruce Fleischer,Matthew Ziegler,Joel Silberman,Jinwook Oh,Vijayalakshmi Srinivasan,Jungwook Choi,Silvia Mueller,Ankur Agrawal,Tina Babinsky,Nianzheng Cao,Chia-Yu Chen,Pierce Chuang,Thomas Fox,George Gristede,Michael Guillorn,Howard Haynie,Michael Klaiber,Dongsoo Lee,Shih-Hsien Lo,Gary Maier,Michael Scheuermann,Swagath Venkataramani,Christos Vezyrtzis,Naigang Wang,Fanchieh Yee,Ching Zhou,Pong-Fei Lu,Brian Curran,Leland Chang,Kailash Gopalakrishnan
DOI: https://doi.org/10.1109/lssc.2019.2902738
2018-12-01
Abstract:This letter presents a multi-TOPS AI accelerator core for deep learning training and inference. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture to provide high throughput and an on-chip scratchpad hierarchy to meet the bandwidth demands of the compute units. A custom 16b floating point (fp16) representation with 1 sign bit, 6 exponent bits, and 9 mantissa bits has also been developed for high model accuracy in training and inference as well as 1b/2b (binary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14-nm CMOS.
English Else
What problem does this paper attempt to address?