Deming Chen,Alaa Youssef,Ruchi Pendse,André Schleife,Bryan K. Clark,Hendrik Hamann,Jingrui He,Teodoro Laino,Lav Varshney,Yuxiong Wang,Avirup Sil,Reyhaneh Jabbarvand,Tianyin Xu,Volodymyr Kindratenko,Carlos Costa,Sarita Adve,Charith Mendis,Minjia Zhang,Santiago Núñez-Corrales,Raghu Ganti,Mudhakar Srivatsa,Nam Sung Kim,Josep Torrellas,Jian Huang,Seetharami Seelam,Klara Nahrstedt,Tarek Abdelzaher,Tamar Eilam,Huimin Zhao,Matteo Manica,Ravishankar Iyer,Martin Hirzel,Vikram Adve,Darko Marinov,Hubertus Franke,Hanghang Tong,Elizabeth Ainsworth,Han Zhao,Deepak Vasisht,Minh Do,Fabio Oliveira,Giovanni Pacifici,Ruchir Puri,Priya Nagpurkar

Abstract:This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, full-stack co-design approaches, emphasizing usability, manageability, affordability, adaptability, efficiency, and scalability. By integrating cutting-edge technologies such as generative and agentic AI, cross-layer automation and optimization, unified control plane, and composable and adaptive system architecture, the proposed framework addresses critical challenges in energy efficiency, performance, and cost-effectiveness. Incorporating quantum computing as it matures will enable quantum-accelerated simulations for materials science, climate modeling, and other high-impact domains. Collaborative efforts between academia and industry are central to this vision, driving advancements in foundation models for material design and climate solutions, scalable multimodal data processing, and enhanced physics-based AI emulators for applications like weather forecasting and carbon sequestration. Research priorities include advancing AI agentic systems, LLM as an Abstraction (LLMaaA), AI model optimization and unified abstractions across heterogeneous infrastructure, end-to-end edge-cloud transformation, efficient programming model, middleware and platform, secure infrastructure, application-adaptive cloud systems, and new quantum-classical collaborative workflows. These ideas and solutions encompass both theoretical and practical research questions, requiring coordinated input and support from the research community. This joint initiative aims to establish hybrid clouds as secure, efficient, and sustainable platforms, fostering breakthroughs in AI-driven applications and scientific discovery across academia, industry, and society.

The infrastructure powering IBM's Gen AI model development

Transforming the Hybrid Cloud for Emerging AI Workloads

P F ] 1 3 A ug 2 01 9 HPC AI 500 : A Benchmark Suite for HPC AI Systems

Hardware-middleware system co-design for flexible training of foundation models in the cloud

Apple Intelligence Foundation Language Models

Accelerated Cloud for Artificial Intelligence (ACAI)

AI Tax: The Hidden Cost of AI Data Center Applications

Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native

Scalable Deployment of AI Time-series Models for IoT

Using the IBM Analog In-Memory Hardware Acceleration Kit for Neural Network Training and Inference

Intelligent Computing、Computational Power、 Computational Power Networks and Technology Ecosystems

Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

Model ... Training data Test data at day 1 Test data at day 100 Model Model Model Model ... Cloud ... Hospital 1 Hospital 2 Hospital n

Optimizing Enterprise AI Adoption with Converged Infrastructure: The Role of NVIDIA AI Enterprise and VMware in Streamlining IT Stack and Enhancing Resource Allocation

Optimizing Cloud Infrastructure for Real-time AI Processing: Challenges and Solutions

Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node

Comprehensive Performance Modeling and System Design Insights for Foundation Models

NeuNetS: An Automated Synthesis Engine for Neural Network Design

Convergence of Artificial Intelligence and High Performance Computing on NSF-supported Cyberinfrastructure

The Unseen AI Disruptions for Power Grids: LLM-Induced Transients

Power Hungry Processing: Watts Driving the Cost of AI Deployment?