The infrastructure powering IBM's Gen AI model development

Talia Gershon,Seetharami Seelam,Brian Belgodere,Milton Bonilla,Lan Hoang,Danny Barnett,I-Hsin Chung,Apoorve Mohan,Ming-Hung Chen,Lixiang Luo,Robert Walkup,Constantinos Evangelinos,Shweta Salaria,Marc Dombrowa,Yoonho Park,Apo Kayi,Liran Schour,Alim Alim,Ali Sydney,Pavlos Maniotis,Laurent Schares,Bernard Metzler,Bengi Karacali-Akyamac,Sophia Wen,Tatsuhiro Chiba,Sunyanan Choochotkaew,Takeshi Yoshimura,Claudia Misale,Tonia Elengikal,Kevin O Connor,Zhuoran Liu,Richard Molina,Lars Schneidenbach,James Caden,Christopher Laibinis,Carlos Fonseca,Vasily Tarasov,Swaminathan Sundararaman,Frank Schmuck,Scott Guthridge,Jeremy Cohn,Marc Eshel,Paul Muench,Runyu Liu,William Pointer,Drew Wyskida,Bob Krull,Ray Rose,Brent Wolfe,William Cornejo,John Walter,Colm Malone,Clifford Perucci,Frank Franco,Nigel Hinds,Bob Calio,Pavel Druyan,Robert Kilduff,John Kienle,Connor McStay,Andrew Figueroa,Matthew Connolly,Edie Fost,Gina Roma,Jake Fonseca,Ido Levy,Michele Payne,Ryan Schenkel,Amir Malki,Lion Schneider,Aniruddha Narkhede,Shekeba Moshref,Alexandra Kisin,Olga Dodin,Bill Rippon,Henry Wrieth,John Ganci,Johnny Colino,Donna Habeger-Rose,Rakesh Pandey,Aditya Gidh,Aditya Gaur,Dennis Patterson,Samsuddin Salmani,Rambilas Varma,Rumana Rumana,Shubham Sharma,Aditya Gaur,Mayank Mishra,Rameswar Panda,Aditya Prasad,Matt Stallone,Gaoyuan Zhang,Yikang Shen,David Cox,Ruchir Puri,Dakshi Agrawal,Drew Thorstensen,Joel Belog,Brent Tang,et al. (46 additional authors not shown)
2024-07-08
Abstract:AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence
What problem does this paper attempt to address?