SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Raghu Prabhakar,Ram Sivaramakrishnan,Darshan Gandhi,Yun Du,Mingran Wang,Xiangyu Song,Kejie Zhang,Tianren Gao,Angela Wang,Karen Li,Yongning Sheng,Joshua Brot,Denis Sokolov,Apurv Vivek,Calvin Leung,Arjun Sabnis,Jiayu Bai,Tuowen Zhao,Mark Gottscho,David Jackson,Mark Luttrell,Manish K. Shah,Edison Chen,Kaizhao Liang,Swayambhoo Jain,Urmish Thakker,Dawei Huang,Sumti Jairath,Kevin J. Brown,Kunle Olukotun
2024-05-13
Abstract:Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them.
Hardware Architecture,Artificial Intelligence
What problem does this paper attempt to address?
The paper addresses the cost and complexity issues faced in training and deploying large-scale language models (LLMs), as well as the memory wall problem caused by the imbalance between computation and memory in modern AI accelerators. By combining Composition of Experts (CoE) models, stream-based data flow, and a three-tier memory system, the paper proposes a method to scale AI memory wall. This method is implemented on the SambaNova SN40L reconfigurable data flow unit, achieving a CoE system with 150 expert models and 1 trillion parameters, improving efficiency and reducing costs. Through experiments, it demonstrates speed improvements compared to traditional hardware, reduces machine footprint, accelerates model switching speed, and achieves performance enhancements on DGX systems.