Hardware-middleware system co-design for flexible training of foundation models in the cloud

Seetharami R. Seelam
DOI: https://doi.org/10.1145/3568161.3568317
2022-11-07
Abstract:Foundation models are a new class of AI models that are trained on broad data (typically via self-supervision) and that can be used in different downstream tasks. Due to self-supervision and the ability to train on massive amounts of unlabeled data, these models grew to have hundreds of billions of parameters, and they take many months on hundreds of GPU to train and generate a foundation model. So, AI Systems and middleware are critical to train these foundation models in scalable, cost-effective manner. In this talk, I will discuss the architecture of a new cloud-based AI System to train large scale foundation models. The system is built entirely out of open source software stack from hypervisor to guest operating systems, from container platforms to AI frameworks and libraries. It is natively built into IBM Cloud platform and the hardware and software stack is optimized for training of foundation models on hundreds of GPUs. We trained various foundation models with state-of-the-art accuracy in the shortest time on this platform. I will discuss the architecture, operational experience, and thoughts on the directions for the co-design of hardware and middleware for future AI Systems.
Computer Science,Engineering
What problem does this paper attempt to address?