SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
Raghu Prabhakar,Ram Sivaramakrishnan,Darshan Gandhi,Yun Du,Mingran Wang,Xiangyu Song,Kejie Zhang,Tianren Gao,Angela Wang,Karen Li,Yongning Sheng,Joshua Brot,Denis Sokolov,Apurv Vivek,Calvin Leung,Arjun Sabnis,Jiayu Bai,Tuowen Zhao,Mark Gottscho,David Jackson,Mark Luttrell,Manish K. Shah,Edison Chen,Kaizhao Liang,Swayambhoo Jain,Urmish Thakker,Dawei Huang,Sumti Jairath,Kevin J. Brown,Kunle Olukotun
2024-05-13
Abstract:Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them.
Hardware Architecture,Artificial Intelligence