SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Raghu Prabhakar,Ram Sivaramakrishnan,Darshan Gandhi,Yun Du,Mingran Wang,Xiangyu Song,Kejie Zhang,Tianren Gao,Angela Wang,Karen Li,Yongning Sheng,Joshua Brot,Denis Sokolov,Apurv Vivek,Calvin Leung,Arjun Sabnis,Jiayu Bai,Tuowen Zhao,Mark Gottscho,David Jackson,Mark Luttrell,Manish K. Shah,Edison Chen,Kaizhao Liang,Swayambhoo Jain,Urmish Thakker,Dawei Huang,Sumti Jairath,Kevin J. Brown,Kunle Olukotun

2024-05-13

Abstract:Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them.

Hardware Architecture,Artificial Intelligence

What problem does this paper attempt to address?

The paper addresses the cost and complexity issues faced in training and deploying large-scale language models (LLMs), as well as the memory wall problem caused by the imbalance between computation and memory in modern AI accelerators. By combining Composition of Experts (CoE) models, stream-based data flow, and a three-tier memory system, the paper proposes a method to scale AI memory wall. This method is implemented on the SambaNova SN40L reconfigurable data flow unit, achieving a CoE system with 150 expert models and 1 trillion parameters, improving efficiency and reducing costs. Through experiments, it demonstrates speed improvements compared to traditional hardware, reduces machine footprint, accelerates model switching speed, and achieves performance enhancements on DGX systems.

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Composition of Experts: A Modular Compound AI System Leveraging Large Language Models

DaDianNao: A Machine-Learning Supercomputer

CMN: a co-designed neural architecture search for efficient computing-in-memory-based mixture-of-experts

Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

AI and Memory Wall

Compute-in-Memory Technologies and Architectures for Deep Learning Workloads

Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine

Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models

MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

A Comprehensive Evaluation of Novel AI Accelerators for Deep Learning Workloads

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native

Special Topic on Nonvolatile Memory for Efficient Implementation of Neural/Neuromorphic Computing

Exploring Sparse Expert Models and Beyond

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster