A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling
Sae Kyu Lee,Ankur Agrawal,Joel Silberman,Matthew Ziegler,Mingu Kang,Swagath Venkataramani,Nianzheng Cao,Bruce Fleischer,Michael Guillorn,Matthew Cohen,Silvia M. Mueller,Jinwook Oh,Martin Lutz,Jinwook Jung,Siyu Koswatta,Ching Zhou,Vidhi Zalani,Monodeep Kar,James Bonanno,Robert Casatuta,Chia-Yu Chen,Jungwook Choi,Howard Haynie,Alyssa Herbert,Radhika Jain,Kyu-Hyoun Kim,Yulong Li,Zhibin Ren,Scot Rider,Marcel Schaal,Kerstin Schelm,Michael R. Scheuermann,Xiao Sun,Hung Tran,Naigang Wang,Wei Wang,Xin Zhang,Vinay Shah,Brian Curran,Vijayalakshmi Srinivasan,Pong-Fei Lu,Sunil Shukla,Kailash Gopalakrishnan,Leland Chang
DOI: https://doi.org/10.1109/jssc.2021.3120113
2022-01-01
Abstract:Reduced precision computation is a key enabling factor for energy-efficient acceleration of deep learning (DL) applications. This article presents a 7-nm four-core mixed-precision artificial intelligence (AI) chip that supports four compute precisions—FP16, Hybrid-FP8 (HFP8), INT4, and INT2—to support diverse application demands for training and inference. The chip leverages cutting-edge algorithmic advances to demonstrate leading-edge power efficiency for 8-bit floating-point (FP8) training and INT4 inference without model accuracy degradation. A new HFP8 format combined with separation of the floating- and fixed-point pipelines and aggressive circuit/architecture optimization enables performance improvements while maintaining high compute utilization. A high-bandwidth ring protocol enables efficient data communication, while power management using workload-aware clock throttling maximizes performance within a given power budget. The AI chip demonstrates 3.58-TFLOPS/W peak energy efficiency and 26.2-TFLOPS peak performance for HFP8 iso-accuracy training, and 16.9-TOPS/W peak energy efficiency and 104.9-TOPS peak performance for INT4 iso-accuracy inference.
engineering, electrical & electronic